Batch Analysis#
Batch Analysis#
This module provides an exploratory framework for understanding the technical effects of batch variables at different hierarchical levels in single-cell datasets. It quantifies the strength of technical covariates by measuring the linear variance they explain using principal component regression (PCR). This systematic evaluation helps to assess the impact of each covariate, making it easier to decide which variable should be treated as the batch for subsequent batch correction or data integration steps.
Features#
Principal component regression (PCR) analysis for quantifying linear effects of technical covariates on data
Theil’s U analysis for quantifying the association between categorical covariates and principal components
Pseudobulk generation and PCA plotting for sample-level exploration of batch structure
Optional preprocessing pipeline: normalization, gene filtering, PCA
Parallelized computation of batch PCR for scalability
Flexible configuration for covariates, permutations, and sample keys
Environments#
The following conda environments are used by different steps:
scanpy(for preprocessing, PCA, plotting)scib(for PC regression and batch PCR)
Configuration#
Configure the module under your dataset key using the batch_analysis section. Common keys:
sample: column(s) in.obsto use as the sample key (can be comma-separated for composite keys). It should represent the smallest common grouping of a technical effect and the covariate of interest. An error will be thrown if this is not the case.covariates: list of covariate columns in.obsto test for batch effects. These covariates are also used to color the pseudobulk PCA plots and to generate the Theil’s U heatmap.permute_covariates: (optional) list of covariates to permute for computing a z-score. If not specified, allcovariateswill be used for permutations. The covariates will be permuted per sample to compute empirical null distributionsn_permutations: number of permutations for each covariateStep-specific overrides (e.g.,
normalize,highly_variable_genes,pca) for preprocessing
Note: Preprocessing steps such as normalization, and PCA are optional. Each step will only be executed if its corresponding key (
normalize,highly_variable_genes,pca) is defined in your configuration. If a key is omitted, that step will be skipped for the dataset.
Example configuration#
DATASETS:
BATCH_ANALYSIS_PREPROCESSED:
input:
batch_analysis:
file_1: test/input/blood_pca.zarr
batch_analysis:
covariates:
- sample
- donor
- assay
- sex
- disease
- self_reported_ethnicity
permute_covariates:
- assay
- sex
- disease
- self_reported_ethnicity
n_permutations: 1000
sample: sample,donor
BATCH_ANALYSIS_UNPROCESSED:
input:
batch_analysis:
file_1: test/input/pbmc68k.h5ad
batch_analysis:
sample: batch, bulk_labels
covariates:
- bulk_labels
- batch
- is_cd14_mono
normalize:
highly_variable_genes:
pca:
The example configuration above demonstrates how to set up the batch_analysis module for two datasets:
BATCH_ANALYSIS_PREPROCESSED: Uses a dataset with precomputed PCA in the input AnnData object. It specifies multiple covariates (e.g.,sample,donor,assay,sex,disease,self_reported_ethnicity) to test for batch effects, and defines which covariates to permute for significance testing. The number of permutations is set to 1000, and a composite sample key (sample,donor) is used.BATCH_ANALYSIS_UNPROCESSED: Uses a dataset without the necessary PCA information and configures the workflow to perform normalization, highly variable gene selection, and PCA as preprocessing steps. It setsbatchandbulk_labelsas the composite sample key, and tests covariates such asbulk_labels,batch, andis_cd14_monofor batch effects.
Workflow steps#
The batch_analysis workflow consists of the following steps:
flowchart TD
A[Preprocessing] --> B[Prepare data]
B --> C[Theil's U]
B --> D[Pseudobulk PCA plot]
B --> E{determine covariates}
E --> F[PC regression]
F --> G[Plots]
Preprocessing (optional)
Steps: normalize, filter genes, HVG selection, PCA
Each preprocessing step is optional and will only be executed if its corresponding key is defined in the configuration. For example, if
normalizeis defined, normalization and all downstream steps will be performed until PCA; ifpcais defined, only PCA will be computed. This allows users to skip preprocessing if their input data is already preprocessed and contains the necessary PCA information for batch PCR analysis.Uses rules from the preprocessing module, with dataset-specific overrides.
Prepare data:
Sets sample key for pseudobulk aggregation and PCR analysis based on the configured
samplekey. The sample key represents the smallest common grouping of a technical effect and the covariate of interest (e.g.,batchordonor).Aggregates cells into pseudobulk samples using the configured
samplekey.Recomputes PCA on the pseudobulk data and colors the PCA plots by the configured
covariates.
Covariate setup (
determine_covariates):
Determines which covariates to test and sets up permutation schemes.
Batch PC regression (
batch_pcr):Runs principal component regression for each covariate and permutation, computes z-scores.
Collect:
Aggregates per-covariate results into a single table.
Plot:
Generates barplots and violin plots summarizing PCR and permutation results.
Generates Theil’s U plots to visualize the association between covariates and principal components.
Output#
Pseudobulk#
Pseudobulk AnnData/Zarr output:
<output_dir>/batch_analysis/prepare/dataset~<dataset>/file_id~<file_id>/pseudobulks.zarrPseudobulk PCA plots:
<images>/batch_analysis/dataset~<dataset>/file_id~<file_id>/pca_plots
Principal regression analysis#
Per-covariate PCR results:
<output_dir>/batch_analysis/dataset~<dataset>/file_id~<file_id>/batch_pcr/{covariate}.tsvAggregated results:
<output_dir>/batch_analysis/dataset~<dataset>/file_id~<file_id>/batch_pcr.tsvPlots:
<images>/batch_analysis/dataset~<dataset>/file_id~<file_id>/batch_pcr_bar.png,<images>/<dataset>/batch_pcr_violin.png
Each result file contains columns such as:
covariate: tested covariatepcr: PCR scorepermuted: whether the score is from a permutationn_covariates: number of unique values in the covariatez_score: z-score of observed PCR vs. permutations
Theil’s U#
Results:
<images>/batch_analysis/dataset~<dataset>/file_id~<file_id>/theils_u.tsvPlots:
<images>/batch_analysis/dataset~<dataset>/file_id~<file_id>/theils_u_heatmap.png
Testing#
Activate the snakemake environment and run the test workflow:
conda activate snakemake
bash test/run_test.sh -n
This test requires precomputed objects from the preprocessing tests.
See the preprocessing module for more details.