Merge#
Merge files#
Merge multiple AnnData files into a single consolidated dataset by concatenating observations (cells) across files. This module handles merging datasets with different gene sets, metadata schemas, and supports memory-efficient processing for large datasets.
Configuration#
DATASETS:
Lee2020:
input:
merge:
file_1: test/input/load_data/harmonize_metadata/Lee2020.zarr
file_2: test/input/load_data/harmonize_metadata/SchulteSchrepping2020.zarr
merge:
merge_strategy: inner
keep_all_columns: true
allow_duplicate_obs: true
allow_duplicate_vars: false
new_indices: false
persist: true
threads: 5
stride: 500_000
dask: true
backed: true
slots:
X: X
obs: obs
var: var
layers: layers
Configuration Options#
merge_strategy: How to handle overlapping genes/variables across datasets"inner": Keep only genes present in all input datasets (intersection)"outer": Keep all genes from all datasets, filling missing values with zeros (union)
keep_all_columns: How to handle observation metadata columnstrue: Keep all obs columns from all datasets, filling missing values with NaNfalse: Only keep obs columns that are present in all datasets
allow_duplicate_obs: How to handle duplicate observation names during mergingtrue: Allow duplicate cell barcodes/names in the final datasetfalse: After merging, remove duplicate observations by keeping only the first occurrence of each duplicate name. Note: Duplicates between different input files will be silently dropped, which can result in data loss if files have overlapping cell names.
allow_duplicate_vars: Whether duplicate variable names are allowed in the final merged datasettrue: Allow duplicate gene namesfalse: Raise error if duplicate gene names are found
new_indices: Whether to generate new sequential cell identifierstrue: Create new cell IDs in format{dataset}-{index}where dataset comes from wildcard and index is sequential (0, 1, 2, …)false: Preserve original cell names from input files (default)
threads: Number of threads for parallel processing (used with Dask)stride: Cell batch size for processing large datasetsControls how many cells are processed/merged in a single chunk.
In Dask mode with
persist: true, this value is used as the batch size for each persisted merge step (up tostridecells per batch).Smaller values reduce peak memory usage at the cost of more, smaller batches and higher scheduling overhead; larger values improve throughput but increase peak RAM usage.
dask: Enable Dask arrays for distributed/out-of-core processingtrue: Use Dask for memory-efficient processing of very large datasetsfalse: Use standard in-memory processing
backed: Enable backed mode using AnnCollection for efficient mergingtrue: Keep data on disk during merging processfalse: Load all data into memory
persist: Persist intermediate Dask arrays during merge to keep task graphs shallowtrue: In Dask mode with more than 2 inputs, merges files in cell-count batches of up tostridecells and persists each intermediate result.false: Uses direct concatenation without intermediate persistence (default).Persists materialize intermediate Dask arrays into worker memory; this can significantly increase peak RAM usage compared to non-persisted execution, especially for large
stridevalues or very large datasets.Recommended for performance when enough memory is available; consider disabling or reducing
strideon memory-constrained systems.Only relevant when
dask: true
slots: Specify which data slots to read from zarr filesDictionary mapping slot names to zarr group names
Only applies to zarr input files
Processing Modes#
Standard Mode (default)#
Loads files sequentially into memory
Uses
scanpy.concat()for mergingBest performance for datasets that fit in memory
Dask Mode (dask: true)#
Uses Dask arrays for out-of-core processing
Memory-efficient for very large datasets
Supports parallel processing across chunks
With
persist: trueand more than 2 input files, intermediate merges are persisted to reduce graph depth and scheduler overhead
Backed Mode (backed: true)#
Uses AnnCollection to keep data on disk
Good balance between memory efficiency and performance
Suitable for moderately large datasets
Behavior#
File Processing#
Single File: Creates symbolic link instead of merging if only one input file
Multiple Files: Concatenates all files along the observation (cell) axis
Gene Alignment: Aligns gene sets according to
merge_strategyMetadata Merging: Combines observation metadata according to
keep_all_columns
Index and Metadata Handling#
Cell Identifiers:
When
new_indices: true, generates new cell IDs in format{dataset}-{sequential_index}wheredatasetis from the wildcard andsequential_indexis a running index (0, 1, 2, …). Original cell names are preserved inobs_names_before_{dataset}columnWhen
new_indices: false, preserves original cell IDs from input files
Dataset Tracking: Adds
file_idcolumn to obs and stores dataset info inuns['merge']Duplicate Checking:
If
allow_duplicate_obs=False, duplicate observations are automatically removed (keeping first occurrence)If
allow_duplicate_vars=False, duplicate variables cause an error and merging is halted
Memory Optimization#
Automatic garbage collection between files
Slot removal to free memory during processing
Sparse matrix format preservation
Chunked processing for large datasets
Optional intermediate persistence for Dask merges (
persist: true)
Output#
A single zarr file containing the merged AnnData object with:
All observations (cells) from input files concatenated
Aligned gene sets based on merge strategy
Combined metadata with file-of-origin tracking
Preserved data types and sparse formats where possible
Use Cases#
Multi-sample studies: Combine data from different experimental conditions
Cross-dataset integration: Merge datasets from different studies or technologies
Batch processing: Consolidate results from parallel processing pipelines
Data warehousing: Create unified datasets for downstream analysis