Merge

Merge#

Merge files#

Merge multiple AnnData files into a single consolidated dataset by concatenating observations (cells) across files. This module handles merging datasets with different gene sets, metadata schemas, and supports memory-efficient processing for large datasets.

Configuration#

DATASETS:
  Lee2020:
    input:
      merge:
        file_1: test/input/load_data/harmonize_metadata/Lee2020.zarr
        file_2: test/input/load_data/harmonize_metadata/SchulteSchrepping2020.zarr
    merge:
      merge_strategy: inner
      keep_all_columns: true
      allow_duplicate_obs: true
      allow_duplicate_vars: false
      new_indices: false
      persist: true
      threads: 5
      stride: 500_000
      dask: true
      backed: true
      slots:
        X: X
        obs: obs
        var: var
        layers: layers

Configuration Options#

merge_strategy: How to handle overlapping genes/variables across datasets
- "inner": Keep only genes present in all input datasets (intersection)
- "outer": Keep all genes from all datasets, filling missing values with zeros (union)
keep_all_columns: How to handle observation metadata columns
- true: Keep all obs columns from all datasets, filling missing values with NaN
- false: Only keep obs columns that are present in all datasets
allow_duplicate_obs: How to handle duplicate observation names during merging
- true: Allow duplicate cell barcodes/names in the final dataset
- false: After merging, remove duplicate observations by keeping only the first occurrence of each duplicate name. Note: Duplicates between different input files will be silently dropped, which can result in data loss if files have overlapping cell names.
allow_duplicate_vars: Whether duplicate variable names are allowed in the final merged dataset
- true: Allow duplicate gene names
- false: Raise error if duplicate gene names are found
new_indices: Whether to generate new sequential cell identifiers
- true: Create new cell IDs in format {dataset}-{index} where dataset comes from wildcard and index is sequential (0, 1, 2, …)
- false: Preserve original cell names from input files (default)
threads: Number of threads for parallel processing (used with Dask)
stride: Cell batch size for processing large datasets
- Controls how many cells are processed/merged in a single chunk.
- In Dask mode with persist: true, this value is used as the batch size for each persisted merge step (up to stride cells per batch).
- Smaller values reduce peak memory usage at the cost of more, smaller batches and higher scheduling overhead; larger values improve throughput but increase peak RAM usage.
dask: Enable Dask arrays for distributed/out-of-core processing
- true: Use Dask for memory-efficient processing of very large datasets
- false: Use standard in-memory processing
backed: Enable backed mode using AnnCollection for efficient merging
- true: Keep data on disk during merging process
- false: Load all data into memory
persist: Persist intermediate Dask arrays during merge to keep task graphs shallow
- true: In Dask mode with more than 2 inputs, merges files in cell-count batches of up to stride cells and persists each intermediate result.
- false: Uses direct concatenation without intermediate persistence (default).
- Persists materialize intermediate Dask arrays into worker memory; this can significantly increase peak RAM usage compared to non-persisted execution, especially for large stride values or very large datasets.
- Recommended for performance when enough memory is available; consider disabling or reducing stride on memory-constrained systems.
- Only relevant when dask: true
slots: Specify which data slots to read from zarr files
- Dictionary mapping slot names to zarr group names
- Only applies to zarr input files

Processing Modes#

Standard Mode (default)#

Loads files sequentially into memory
Uses scanpy.concat() for merging
Best performance for datasets that fit in memory

Dask Mode (`dask: true`)#

Uses Dask arrays for out-of-core processing
Memory-efficient for very large datasets
Supports parallel processing across chunks
With persist: true and more than 2 input files, intermediate merges are persisted to reduce graph depth and scheduler overhead

Backed Mode (`backed: true`)#

Uses AnnCollection to keep data on disk
Good balance between memory efficiency and performance
Suitable for moderately large datasets

Behavior#

File Processing#

Single File: Creates symbolic link instead of merging if only one input file
Multiple Files: Concatenates all files along the observation (cell) axis
Gene Alignment: Aligns gene sets according to merge_strategy
Metadata Merging: Combines observation metadata according to keep_all_columns

Index and Metadata Handling#

Cell Identifiers:
- When new_indices: true, generates new cell IDs in format {dataset}-{sequential_index} where dataset is from the wildcard and sequential_index is a running index (0, 1, 2, …). Original cell names are preserved in obs_names_before_{dataset} column
- When new_indices: false, preserves original cell IDs from input files
Dataset Tracking: Adds file_id column to obs and stores dataset info in uns['merge']
Duplicate Checking:
- If allow_duplicate_obs=False, duplicate observations are automatically removed (keeping first occurrence)
- If allow_duplicate_vars=False, duplicate variables cause an error and merging is halted

Memory Optimization#

Automatic garbage collection between files
Slot removal to free memory during processing
Sparse matrix format preservation
Chunked processing for large datasets
Optional intermediate persistence for Dask merges (persist: true)

Output#

A single zarr file containing the merged AnnData object with:

All observations (cells) from input files concatenated
Aligned gene sets based on merge strategy
Combined metadata with file-of-origin tracking
Preserved data types and sparse formats where possible

Use Cases#

Multi-sample studies: Combine data from different experimental conditions
Cross-dataset integration: Merge datasets from different studies or technologies
Batch processing: Consolidate results from parallel processing pipelines
Data warehousing: Create unified datasets for downstream analysis