Subset data#
Subsetting cells or samples to create smaller, representative datasets for benchmarking and testing.
This module supports common subsetting strategies such as sampling within samples (per-library) or selecting full samples/datasets.
Environments#
Install only the environments you need. For this module you need:
Configuration#
Example dataset configuration (keys match the implementation and test config):
DATASETS:
by_sample_subset:
input:
subset: path/to/input.h5ad
subset:
sample_key: sample
seed: 42
n_cells: 1000
strategy: by_sample
within_sample_subset:
input:
subset: path/to/input.h5ad
subset:
sample_key: sample
seed: 42
n_cells: 1000
per_sample: 100
strategy: within_sample
Notes about parameter names and mapping:
sample_key: the key inadata.obsthat defines the samples/libraries to subset over.per_sample: number of cells to keep per sample when usingwithin_samplestrategy.seed: random seed for reproducible subsetting.strategy: must match one of the implemented strategies:by_sampleorwithin_sample.
Strategies#
by_sample#
Selects entire samples (libraries) and keeps adding shuffled samples until the
n_cells(n_cell_max) is reached.Samples smaller than
min_cells_per_sample(default 100) are skipped.Useful for testing behavior across full-sample composition and batch effects.
within_sample#
Samples up to
per_sample(n_cells_per_sample) cells per sample. Ifper_sampleis not set, the code computes floor(n_cells / n_samples).Samples are shuffled to avoid systematic bias; sampling is done without replacement.
If
n_cellsis None the function will default to keeping all cells (no downsampling).
Testing#
A test config is included under workflow/subset/test/config.yaml showing usage patterns for both strategies. Example test commands used by the project:
conda activate snakemake
bash test/run_test.sh -n # dry run
bash test/run_test.sh -c2 # actual run with max 2 cores
Output#
Updated AnnData and summary files:
<out_dir>/subset/dataset~<datasets>/file_id~<file_id>.zarr— AnnData with subsetting metadata and reduced observation set (for zarr inputs a reference/linking strategy is attempted when possible).adata.obs[‘subset’] — boolean mask (True for retained cells) set by the subsetting functions.
adata.uns[‘subset’] — dictionary containing the subsetting strategy and parameters used (e.g., strategy, n_cells, per_sample/n_cells_per_sample, seed/min_cells_per_sample).
Note: For zarr input stores the pipeline will attempt to reference or link original arrays to avoid copying large arrays where supported by the filesystem/backend. If linking is unsupported some arrays may be copied.