Preprocessing

Preprocessing#

This module runs a configurable single-cell preprocessing workflow and assembles selected results into one output zarr file.

Implemented steps:

Normalize (normalize.py)
Mark non-zero genes for HVG-safe filtering (filter_genes.py)
Highly variable genes (highly_variable_genes.py)
Optional extra HVGs (extra_hvgs.py)
PCA (pca.py)
Neighbors graph (neighbors.py)
UMAP (umap.py)
PCA/UMAP plots (plot.py)
Assembly (assemble.py)

Rules are declared in rules/rules.smk, parameterized in rules/assemble.smk, and plotting rules are in rules/plots.smk.

Inputs, outputs, and execution#

The module reads dataset inputs from DATASETS.<dataset>.input.preprocessing.
Inputs may be .h5ad or .zarr.
Rule outputs are written as .zarr.
The default target (rule all in Snakefile) runs:
- assembly output zarr files
- PCA and UMAP plots

The assembled output path follows the module parameter space and is written under the configured output directory.

Global configuration#

Common knobs (global defaults, overridable per step):

dask
n_threads
resources
step-specific argument dictionaries (normalize, filter, highly_variable_genes, extra_hvgs, pca, neighbors, umap)

Example:

preprocessing:
  dask: true
  n_threads: 10
  resources: gpu

Step details#

Normalize#

Script: scripts/normalize.py

Behavior:

Reads raw counts from raw_counts (default X)
Ensures sparse representation
Runs scanpy.pp.normalize_total and scanpy.pp.log1p
Stores normalized matrix in both .X and .layers["normcounts"]
Preserves raw counts in .layers["counts"] and .raw
Writes metadata under .uns["preprocessing"] and .uns["log1p"]

Important params:

raw_counts
gene_id_column
normalize (args passed to scanpy.pp.normalize_total)
dask

Example:

DATASETS:
  dataset_name:
    preprocessing:
      raw_counts: X
      gene_id_column: gene_id
      normalize:
        target_sum: 1e4

Filter genes#

Script: scripts/filter_genes.py

Behavior:

Computes .var["nonzero_genes"] using _filter_genes.
This is a marker step used by HVG/extra-HVG steps; it does not shrink the final feature space by itself.

Important params:

filter (default contains min_cells: 1)
dask

Example:

DATASETS:
  dataset_name:
    preprocessing:
      filter:
        min_cells: 3

Highly variable genes#

Script: scripts/highly_variable_genes.py

Behavior:

Runs HVG selection on filtered cells/genes.
Maps HVG results back to the full .var.
Always provides highly_variable; additionally writes a parameterized variant column highly_variable-... when args are provided.
Stores run args in .uns["preprocessing"]["highly_variable_genes"].

Notes:

subset is explicitly removed from args in the script.
If args are False, all genes are marked as highly variable.

Example:

DATASETS:
  dataset_name:
    preprocessing:
      highly_variable_genes:
        n_top_genes: 2000
        flavor: seurat_v3
        batch_key: sample

Extra HVGs#

Script: scripts/extra_hvgs.py

Behavior:

Computes an additional HVG mask in extra_hvgs (or extra_hvgs-... when overwrite_args is used).
Supports:
- union of per-group HVGs via union_over
- adding genes via extra_genes
- removing genes via remove_genes
Stores metadata in .uns["preprocessing"][<extra_hvg_column>].

Example:

DATASETS:
  dataset_name:
    preprocessing:
      extra_hvgs:
        union_over: [lineage]
        extra_genes: [CCR7, PTPRC]
        remove_genes: [MALAT1]
        min_cells: 200
        overwrite_args:
          n_top_genes: 3000
          flavor: seurat_v3

PCA#

Script: scripts/pca.py

Behavior:

Subsets to HVGs using mask_var (default highly_variable).
Optionally scales before PCA (scale).
Writes .obsm["X_pca"], .uns["pca"], and .varm loadings.
Stores preprocessing metadata in .uns["preprocessing"]["pca"] and ...["scaled"].

Example:

DATASETS:
  dataset_name:
    preprocessing:
      scale: true
      pca:
        n_comps: 50
        svd_solver: covariance_eigh

Neighbors#

Script: scripts/neighbors.py

Behavior:

Computes or reuses neighbor graph.
If neighbors params are False, reuses existing graph from input.
Defaults to use_rep="X_pca" when available, otherwise X.
Writes .obsp["distances"], .obsp["connectivities"], .uns["neighbors"].

Example:

DATASETS:
  dataset_name:
    preprocessing:
      neighbors:
        n_neighbors: 15
        metric: cosine
        use_rep: X_pca

UMAP#

Script: scripts/umap.py

Behavior:

Computes UMAP from a selected neighbors graph (neighbors_key, default neighbors).
If required representation is missing in input, it is loaded from the rep input.
Writes .obsm["X_umap"] and updated .uns.

Example:

DATASETS:
  dataset_name:
    preprocessing:
      umap:
        min_dist: 0.3
        spread: 1.0

Plots#

Script: scripts/plot.py

Rules:

preprocessing_plot_pca: basis X_pca
preprocessing_plot_umap: basis X_umap

Behavior:

Plots obs columns from colors.
Treats non-obs colors entries as genes/patterns and creates expression panels.
Supports centroid overlays for categorical columns via plot_centroids.
Uses plot_gene_chunk_size to chunk gene panels.

Example:

DATASETS:
  dataset_name:
    preprocessing:
      colors: [batch, cell_type, CCR7, PTPRC]
      plot_centroids: [cell_type]
      plot_gene_chunk_size: 12

Assembly#

Script: scripts/assemble.py

The assemble list controls which step outputs are linked into the final object.

Supported assemble entries:

normalize
highly_variable_genes
extra_hvgs
pca
neighbors
umap

Notes:

counts is not a standalone assembly key in the current implementation.
For HVG-like outputs, assembly links parameterized columns (for example highly_variable-..., extra_hvgs-...) and also fills default slots (highly_variable, extra_hvgs) for the default file of each type.
Wildcards used to build the assembled output are stored under uns/wildcards.

Example:

DATASETS:
  dataset_name:
    preprocessing:
      assemble:
        - normalize
        - highly_variable_genes
        - extra_hvgs
        - pca
        - neighbors
        - umap

Minimal config example#

DATASETS:
  dataset_name:
    input:
      preprocessing: adata.h5ad
    preprocessing:
      raw_counts: X
      dask: true
      n_threads: 10
      highly_variable_genes:
        n_top_genes: 2000
        batch_key: batch
      extra_hvgs:
        union_over: [lineage]
        extra_genes: [CCR7, PTPRC]
      pca:
        n_comps: 50
      neighbors:
        n_neighbors: 15
      umap:
        min_dist: 0.5
      colors: [batch, cell_type, CCR7]
      plot_centroids: [cell_type]
      plot_gene_chunk_size: 12
      assemble:
        - normalize
        - highly_variable_genes
        - extra_hvgs
        - pca
        - neighbors
        - umap

Notes#

GPU-enabled rules use RAPIDS via the configured GPU environment when available.
Empty datasets are explicitly handled in scripts and written as valid empty zarr outputs.

Preprocessing

Contents

Preprocessing#

Inputs, outputs, and execution#

Global configuration#

Step details#

Normalize#

Filter genes#

Highly variable genes#

Extra HVGs#

PCA#

Neighbors#

UMAP#

Plots#

Assembly#

Minimal config example#

Notes#