Collect#
Collect files#
Collect and merge slots from multiple files into a single AnnData file. This module handles cases where data is split across multiple files and needs to be consolidated, with support for merging metadata and preserving file-specific information.
Parameters#
DATASETS:
SchulteSchrepping2020:
input:
collect:
file_1: test/input/load_data/download/SchulteSchrepping2020.h5ad
file_2: test/input/load_data/harmonize_metadata/SchulteSchrepping2020_local.zarr
collect:
same_slots:
- X
- var
merge_slots:
- obs
- obsm
obs_index_col:
file_2: barcode
sep: "--"
Configuration Options#
same_slots: List of AnnData slots that are identical across all input files. These slots are copied directly from the first file without modification. Common slots includeX(expression matrix) andvar(gene metadata).merge_slots: List of AnnData slots that need to be merged from all input files:For DataFrame slots (like
obs): Columns are merged, with file-specific columns getting suffixed with the separator and file ID (e.g.,cell_type--file_1)For dictionary slots (like
obsm,obsp): Keys are merged, with file-specific keys getting suffixed with the separator and file ID
sep: Separator string used to distinguish file-specific columns/keys (default:"--"). For example, ifsep="--"and a column exists in file_1, it becomescolumn_name--file_1.obs_index_col: Specifies which column to use as the obs index for each file:Can be a string (same column for all files)
Can be a dict mapping file IDs to column names (as shown above)
Can use regex patterns as keys to match multiple file IDs
skip_slots: List of slots to exclude from processing, even if listed insame_slotsormerge_slots.
Behavior#
Single File#
If only one input file is provided, itโs written directly to the output without modification.
Multiple Files#
Index Alignment: All files are aligned to have matching
obs_namesandvar_namesin the same order as the first fileLarge Slot Handling: For memory efficiency, large slots (
layers,obsm,obsp) are reloaded from disk when reordering is neededObs Index Setting: If
obs_index_colis specified, the designated column becomes the new obs indexColumn Merging: For
obsmerging, columns that differ across files are renamed with file suffixes, while identical columns are preserved as-isData Linking: For
same_slots, data is linked/copied from the first file to avoid duplication
Output#
A single zarr file containing the merged AnnData object with:
Aligned indices across all input files
File-specific metadata preserved with appropriate suffixes
Shared data slots copied from the reference file
Merged slots combining information from all input files