Batch Integration

Batch Integration#

This module provides different scRNA integration methods for batch correction.

Configuration#

Here is an example configuration with all available parameters. Parameters that are optional can be ommitted in the config and the pipeline will use default values.

DATASETS:

  dataset_name:

    # input file configuration
    input:
      integration:
        # file_id to file path mapping
        pbmc: test/input/pbmc68k.h5ad

    # module configuration
    integration:
      raw_counts: layers/counts  # optional
      norm_counts: layers/normcounts  # optional

      label:
        - bulk_labels
        - louvain

      batch:
        - phase
        - batch_2

      output_types:
        - full
        - embed
        - knn

      var_mask:  # optional
        - highly_variable
        - highly_variable_2

      neighbors:  # optional
        n_neighbors: 30

      # random seed used by each integration method, default: 0
      seed: 0

      # number of CPU threads to use for preparation and running integration methods
      threads: 3

      methods: # mandatory, needs at least 1 value

        unintegrated: # no hyperparameters defined, will use defaults

        bbknn:
          # define hyperparamters if desired
          neighbors_within_batch: 3

        combat:
          covariates:
            - bulk_labels

        scanorama:
          batch_size: 100

        scvi:
          max_epochs: 100
          early_stopping: true
          categorical_covariate_keys:
            - []
            - [batch_2, phase]
          continuous_covariate_keys:
            - percent_mito

        scgen:
          n_epochs: 10

        scpoli:
          cell_type_keys:
            -
            - bulk_labels
          embedding_dims: 5
          recon_loss: nb
          n_epochs: 50
          early_stopping_kwargs:
            early_stopping_metric: val_prototype_loss
            mode: min
            threshold: 0
            patience: 20
            reduce_lr: true
            lr_patience: 13
            lr_factor: 0.1

        harmonypy:
          sigma: 0.1
          key: batch_2
          n_comps:
            - 10
            - 30

        harmony_pytorch:
          sigma: 0.1
          batch_key: batch_2
          n_comps: 30

Module specification#

Mandatory parameters that must be defined are the input file definition, label, batch and methods. All other paramters are optional. The AnnData file in h5ad or zarr requires the following input:

raw_counts: matrix under .X or .layers containing counts that have not been normalized or log-transformed, needed by e.g. scVI, scANVI, scPoli. Default will be .X
norm_counts: matrix under .X or .layers containing counts that have been normalized or log-transformed, needed by e.g. Scanorama, scGEN, unintegrated. Default will be .X
output_types: filters which methods and method modes to run. There are 3 different abstraction layers that an integration method can work on. Integration could either correct counts (full), a low-dimensional representation of the data (embed) or the kNN graph (knn). Some methods (e.g. Scanorama, unintegrated) provide multiple outputs, which can be utilised independently.
batch, label: columns in .obs that are used to correct the batch effect or inform the integration method with cell type information (label, only for semi-supervised methods like scANVI, scPoli)
neighbors: arguments that get passed to the sc.pp.neighbors function. The kNN is computed for full and embed outputs after integration
methods: methods configuration to determine which methods should be used, as well as which hyperparameters those methods should use
- hyperparamters correspond to the parameters that the integration method supports. For e.g. scvi-tools methods, you can define all parameters for the module setup and training functions at the same level
- The module also supports parameter exploration. If you pass a list to a hyperparamter, all combinations of that hyperparameter with the other parameters will be computed as a separate hyperparameter computation
var_mask: column in .var that you could

Output#

The output will be saved under <output_dir>/integration/. There are mappings of inputs (input_files.tsv) and outputs (output_files.tsv) as well as the mapping of hyperparameters to hex code used in the output files.

Metadata#

.uns['dataset'] name of task/dataset
.uns['methods'] methods applied to this object
.uns['integration'] integration specific entries
- .uns['integration']['method'] intgration method name
- .uns['integration']['label_key'] label used for integration
- .uns['integration']['batch_key'] batch used for integration
- .uns['integration']['output_type'] output type of method (one of ‘knn’, ‘embed’ or ‘full’)

Integrated output#

The direct method output is under {out_dir}/integration/dataset{dataset}/file_id~{file_id}/batch~{batch}/method~{method}--hyperparams~{hyperparams}--label~{label}/adata.zarr Different integration methods provide different types of output. Depending on the output type, the object must contain the following slots:

corrected graph (knn)
- .obsp['connectivities'], .obsp['distances'] integrated kNN graph returned by integration method
corrected embedding (embed)
- .obsm['X_emb'] integrated embedding returned by integration method
corrected features (full)
- .X corrected feature counts returned by integration method
- .obsm['X_pca'] PCA on corrected feature counts

Processed integrated output#

The integrated output files require further processing to be used in downstream analysis. Files with computed embeddings (for full feature outputs) and kNN graphs (for full feature and embedding outputs) are stored under {out_dir}/integration/dataset{dataset}/file_id~{file_id}/batch~{batch}/method~{method}--hyperparams~{hyperparams}--label~{label}--output_type~{output_type}.zarr.

Contributing to the module#

Adding new integration methods to the module simply requires adding a new script under scripts/integration/methods/ and adding meta information of the method in the params.tsv. The name of the script must match the method name in the params.tsv and the method must be defined in the params.tsv, otherwise it won’t be recognised by the pipeline. When developing the script, you can make use of the global variables and functions from the integration_utils.py script. This is particlarly useful if you want to manage different parameter assignments to e.g. train and model setup functions (see scripts/methods/scvi.py for an example). The best way to get started, is to look at one of the scripts (ideally one that has a similar model to what you want to implement).

Additionally, you might want to add a new conda environment YAML file, if the new integration requires different dependencies than what is already provided in envs. The new environment must then be specified in params.tsv and must be installed if you’re using the default running mode of the pipeline.

Once you have implemented the new integration method, you can test if the the method gets recognised via a dry run. See below about testing.

Testing#

Activate the snakemake environment and call test/run_test.sh with run specific Snakemake parameters.

conda activate snakemake
bash test/run_test.sh -n  # dry run
bash test/run_test.sh -c2  # actual run with max 2 cores

The script will call the pipeline and run a test script. If the input files don’t yet exist, the test script might throw a FileNotFoundError. You can ignore it, if you haven’t yet executed the pipeline completely and therefore don’t yet have all outputs.

Add new test scenarios#

You can configure a new test case in the test/config.yaml, which will be recognised in the bash command presented above.

Test model loading#

Some integration methods store pytorch model. Below are tests if these models can be loaded.

conda run --live-stream -n scarches python test/run_scarches_model_loading.py
conda run --live-stream -n scvi-tools python test/run_scvi-tools_model_loading.py

This requires the scarches and scvi-tools conda environments to be installed, which are defined under envs/.