Preprocessing#
This module runs a configurable single-cell preprocessing workflow and assembles selected results into one output zarr file.
Implemented steps:
Normalize (
normalize.py)Mark non-zero genes for HVG-safe filtering (
filter_genes.py)Highly variable genes (
highly_variable_genes.py)Optional extra HVGs (
extra_hvgs.py)PCA (
pca.py)Neighbors graph (
neighbors.py)UMAP (
umap.py)PCA/UMAP plots (
plot.py)Assembly (
assemble.py)
Rules are declared in rules/rules.smk, parameterized in rules/assemble.smk, and plotting rules are in rules/plots.smk.
Inputs, outputs, and execution#
The module reads dataset inputs from
DATASETS.<dataset>.input.preprocessing.Inputs may be
.h5ador.zarr.Rule outputs are written as
.zarr.The default target (
rule allinSnakefile) runs:assembly output zarr files
PCA and UMAP plots
The assembled output path follows the module parameter space and is written under the configured output directory.
Global configuration#
Common knobs (global defaults, overridable per step):
daskn_threadsresourcesstep-specific argument dictionaries (
normalize,filter,highly_variable_genes,extra_hvgs,pca,neighbors,umap)
Example:
preprocessing:
dask: true
n_threads: 10
resources: gpu
Step details#
Normalize#
Script: scripts/normalize.py
Behavior:
Reads raw counts from
raw_counts(defaultX)Ensures sparse representation
Runs
scanpy.pp.normalize_totalandscanpy.pp.log1pStores normalized matrix in both
.Xand.layers["normcounts"]Preserves raw counts in
.layers["counts"]and.rawWrites metadata under
.uns["preprocessing"]and.uns["log1p"]
Important params:
raw_countsgene_id_columnnormalize(args passed toscanpy.pp.normalize_total)dask
Example:
DATASETS:
dataset_name:
preprocessing:
raw_counts: X
gene_id_column: gene_id
normalize:
target_sum: 1e4
Filter genes#
Script: scripts/filter_genes.py
Behavior:
Computes
.var["nonzero_genes"]using_filter_genes.This is a marker step used by HVG/extra-HVG steps; it does not shrink the final feature space by itself.
Important params:
filter(default containsmin_cells: 1)dask
Example:
DATASETS:
dataset_name:
preprocessing:
filter:
min_cells: 3
Highly variable genes#
Script: scripts/highly_variable_genes.py
Behavior:
Runs HVG selection on filtered cells/genes.
Maps HVG results back to the full
.var.Always provides
highly_variable; additionally writes a parameterized variant columnhighly_variable-...when args are provided.Stores run args in
.uns["preprocessing"]["highly_variable_genes"].
Notes:
subsetis explicitly removed from args in the script.If args are
False, all genes are marked as highly variable.
Example:
DATASETS:
dataset_name:
preprocessing:
highly_variable_genes:
n_top_genes: 2000
flavor: seurat_v3
batch_key: sample
Extra HVGs#
Script: scripts/extra_hvgs.py
Behavior:
Computes an additional HVG mask in
extra_hvgs(orextra_hvgs-...whenoverwrite_argsis used).Supports:
union of per-group HVGs via
union_overadding genes via
extra_genesremoving genes via
remove_genes
Stores metadata in
.uns["preprocessing"][<extra_hvg_column>].
Example:
DATASETS:
dataset_name:
preprocessing:
extra_hvgs:
union_over: [lineage]
extra_genes: [CCR7, PTPRC]
remove_genes: [MALAT1]
min_cells: 200
overwrite_args:
n_top_genes: 3000
flavor: seurat_v3
PCA#
Script: scripts/pca.py
Behavior:
Subsets to HVGs using
mask_var(defaulthighly_variable).Optionally scales before PCA (
scale).Writes
.obsm["X_pca"],.uns["pca"], and.varmloadings.Stores preprocessing metadata in
.uns["preprocessing"]["pca"]and...["scaled"].
Example:
DATASETS:
dataset_name:
preprocessing:
scale: true
pca:
n_comps: 50
svd_solver: covariance_eigh
Neighbors#
Script: scripts/neighbors.py
Behavior:
Computes or reuses neighbor graph.
If
neighborsparams areFalse, reuses existing graph from input.Defaults to
use_rep="X_pca"when available, otherwiseX.Writes
.obsp["distances"],.obsp["connectivities"],.uns["neighbors"].
Example:
DATASETS:
dataset_name:
preprocessing:
neighbors:
n_neighbors: 15
metric: cosine
use_rep: X_pca
UMAP#
Script: scripts/umap.py
Behavior:
Computes UMAP from a selected neighbors graph (
neighbors_key, defaultneighbors).If required representation is missing in input, it is loaded from the
repinput.Writes
.obsm["X_umap"]and updated.uns.
Example:
DATASETS:
dataset_name:
preprocessing:
umap:
min_dist: 0.3
spread: 1.0
Plots#
Script: scripts/plot.py
Rules:
preprocessing_plot_pca: basisX_pcapreprocessing_plot_umap: basisX_umap
Behavior:
Plots obs columns from
colors.Treats non-obs
colorsentries as genes/patterns and creates expression panels.Supports centroid overlays for categorical columns via
plot_centroids.Uses
plot_gene_chunk_sizeto chunk gene panels.
Example:
DATASETS:
dataset_name:
preprocessing:
colors: [batch, cell_type, CCR7, PTPRC]
plot_centroids: [cell_type]
plot_gene_chunk_size: 12
Assembly#
Script: scripts/assemble.py
The assemble list controls which step outputs are linked into the final object.
Supported assemble entries:
normalizehighly_variable_genesextra_hvgspcaneighborsumap
Notes:
countsis not a standalone assembly key in the current implementation.For HVG-like outputs, assembly links parameterized columns (for example
highly_variable-...,extra_hvgs-...) and also fills default slots (highly_variable,extra_hvgs) for the default file of each type.Wildcards used to build the assembled output are stored under
uns/wildcards.
Example:
DATASETS:
dataset_name:
preprocessing:
assemble:
- normalize
- highly_variable_genes
- extra_hvgs
- pca
- neighbors
- umap
Minimal config example#
DATASETS:
dataset_name:
input:
preprocessing: adata.h5ad
preprocessing:
raw_counts: X
dask: true
n_threads: 10
highly_variable_genes:
n_top_genes: 2000
batch_key: batch
extra_hvgs:
union_over: [lineage]
extra_genes: [CCR7, PTPRC]
pca:
n_comps: 50
neighbors:
n_neighbors: 15
umap:
min_dist: 0.5
colors: [batch, cell_type, CCR7]
plot_centroids: [cell_type]
plot_gene_chunk_size: 12
assemble:
- normalize
- highly_variable_genes
- extra_hvgs
- pca
- neighbors
- umap
Notes#
GPU-enabled rules use RAPIDS via the configured GPU environment when available.
Empty datasets are explicitly handled in scripts and written as valid empty zarr outputs.