Reference Mapping#
Reference mapping#
This module enables projection of new single-cell RNA-seq datasets onto pre-trained variational autoencoder (VAE) models or foundation models. This allows for:
Transfer learning from large reference datasets to smaller query datasets
Integration of new data with existing atlases
Consistent cell type annotation across studies
Leveraging pre-trained embeddings for downstream analysis
Environments#
The following environments are needed for reference mapping:
Supported Models#
The module currently supports:
scvi-tools models: Various variational inference models (scVI, scANVI, etc.) mapping with scArches
TODO: scArches models: VAE-based models (trVAE, scPoli, Expimap, etc.)
TODO: Foundation models: Large-scale pre-trained models
Configuration#
DATASETS:
test:
input:
reference_mapping:
query: test/input/pbmc68k.h5ad
reference_mapping:
scarches:
layer: X # or layer name like 'counts'
model: test/input/model
model_params:
batch_key: sample_id
labels_key: cell_type
categorical_covariate: [donor, condition]
continuous_covariate: [age]
train_kwargs:
max_epochs: 10
early_stopping: true
check_val_every_n_epoch: 1
Input#
The input AnnData object is the query dataset that should be mapped to the reference model.
The reference model should be defined as a Pytorch model directory under scarches > model.
Configuration Parameters#
layer(default:'X'): Which data layer to use from the query AnnData object'X'uses the main expression matrix (.X)'layers/counts'uses the counts layer from.layers['counts']Any other string uses the corresponding layer from
.layers[layer_name]
var_key(default:None): Column name in.varto use for gene matching between query and reference modelIf
None, uses the.varindex (var_names)Important for ensuring gene names are consistent between query and reference model
Model Parameters (model_params)#
Parameters that align the query data structure with the reference model’s expectations:
batch_key: Column name in.obscontaining batch/sample information (required)labels_key: Column name in.obscontaining cell type labels (optional, can beNonefor unlabeled data)categorical_covariate: List of categorical covariate column names in.obs(optional, e.g.,["donor", "condition"])continuous_covariate: List of continuous covariate column names in.obs(optional, e.g.,["age", "BMI"])
Training Parameters (train_params)#
Parameters that control the reference mapping training process (all optional):
max_epochs: Maximum number of training epochs (default: 100)early_stopping: Whether to use early stopping to prevent overfitting (default: true)check_val_every_n_epoch: How often to run validation during training (default: 5)
Note: The query data must have genes that overlap with the reference model’s training data. Gene matching is performed using the
var_keyparameter or.varindex. Only overlapping genes are used for mapping and missing genes from the query are padded with zeros.
Output#
The reference mapping workflow produces the following outputs:
<out_dir>/reference_mapping/dataset~<dataset>/file_id~<file_id>.zarr: Mapped AnnData object containing:Latent embedding (
obsm['X_emb']): Low-dimensional representations in the reference space
<out_dir>/reference_mapping/model/dataset~<dataset>/file_id~<file_id>/: Updated model directory with query data integrated, ready for further reference mapping tasks