Data loading#

../_images/load_data.png

Workflow#

Given a TSV file and a schema mapping, the pipeline does the following:

  1. Load data from CELLxGENE, DCP, URL or file

  2. Aggregate metadata

  3. Merge datasets per study, organ or custom subset

  4. Filter cells

Examples of configuration files are: test/config.yaml and test/datasets.tsv. By default, dataset configurations are available under configs at the top-level pipeline (git root of this repository), but they can be modified or replaced by custom files.

Load data#

Files can either be read from a specified input file or downloaded from an URL, CELLxGENE or DCP directly. Which way a dataset is loaded, depends on the dataset mapping.

Aggregate Metadata#

The metadata rule adds additional dataset-level information that is included from the input TSV file. This steps expects the data to follow the CELLxGENE schema 3.0.0 and is extended as described below. Other steps include:

  • adding external annotations if annotation file and columns are available in the TSV

  • saving donor IDs under .obs['donor']

  • inferring sample ID from input TSV and saving it under .obs['sample']

The input AnnData objects must contain:

  • .X raw counts, sparse format

  • .obs columns from the schema defined in the schema

  • .obs.index cell barcode

The output AnnData objects will contain:

  • .X raw counts, sparse format

  • .uns['meta'] metadata from TSV file

  • .obs columns from CELLxGENE schema 3.0.0 and a subset of information in .uns['meta'] from EXTRA_COLUMNS from scripts/utils.py

  • .obs['dataset'], .uns['dataset'] name of task/dataset

  • .obs['organ'], .uns['organ'] organ

  • .obs['donor'] donor ID

  • .obs['sample'] sample ID (inferred from input TSV)

  • .obs['barcode'] cell barcodes as declared in index

  • .obs['author_annotation'] author annotation under the author_annotation column of the input TSV

  • .var gene information as specified in CELLxGENE schema 3.0.0

  • .obs.index unique cell identifiers e.g. dataset + numerical index

The AnnData is saved as a zarr file for a better speed to compression tradeoff compared ot gzipped h5ad files.

Merge Data#

This operation is applied by the following rules

  • merge_study: merge all datasets of a study if multiple datasets are available, else create a symlink

  • merge_organ: merge all studies that belong to an organ

  • merge_organ_filter: merge all cells removed by filtering per organ

  • merge_subset: merge datasets by subset defined in the input TSV under subset (overlapping subsets allowed)

The AnnData must contain all the slots described in Aggregate Metadata apart from the .uns slot.

Filter#

Filter cells per study depending on the config.yaml specification. Two keys are available for controlling the filtering behaviour, filter_per_organ specifies global filter paramters for all datasets per organ and filter_per_study allows for study specific filter options. An example of an organ-level filter specification is shown below:

filter_per_organ:
  blood:
    cells_per_sample:
      min: 50
      max: 10000
    mito_pct: 30
    remove_by_colum:
      dataset:
        - Lee2020_2

All organ-level filtering decisions are applied per study. The remove_by_column key can include any columns that are available in the anndata objects per study.

An example of per study filters shows that only the studies that require further filtering need to be overwritten. The filter options are the same as for the organ-level filters.

filter_per_study:
  SchulteSchrepping2020:
    remove_by_colum:
      sample:
        - Schulte-Schrepping_C2P01H_d0
        - Schulte-Schrepping_C2P05F_d0
        - Schulte-Schrepping_C2P07H_d0
        - Schulte-Schrepping_C2P10H_d0
        - Schulte-Schrepping_C2P13F_d0
        - Schulte-Schrepping_C2P15H_d0
        - Schulte-Schrepping_C2P16H_d0
        - Schulte-Schrepping_C2P19H_d0
      donor:
        - C19-CB-0008
      disease:
        - influenza

Both keys can be empty or missing from the config file. In that case, no filtering is applied.

Preparing the input data#

In order to use the data loader module, you need to define the following files with the file locations and dataset-level metadata.

  • Dataset definiton file (configs/datasets.tsv)

  • Schema mapping (e.g. configs/schema_mapping.tsv)

  • Configuration file (e.g. configs/imported/config.yaml)

  • DCP metadata (optional)

Additionally, you need to prepare your input AnnData files to contain the metadata that is defined in your dataset definition file.

Dataset definition file#

The dataset definition file should specify which datasets you want to include for your analyses together with any additional dataset-level metadata.

Column

Description

dataset

Name of the dataset, multiple datasets make up a study

study

Name of the study, used for aggregating datasets to study level

organ

Name of the organ, can be tissue or any other name for aggregating the atlas

donor_column

Column with donor IDs a donor is an individual who provided the sample

sample_column

Column with sample IDs ideally a sample is a subset of cells associated with an individual. The data should be deconvoluted, i. e., a sample mustn’t contain multiple individuals

author_annotation

Column with author annotations (needed for annotation quality assessment and label harmonisation)

cell_type

Optional. Column with cell ontology labels (needed to for different versions of the CELLxGENE schema, TODO: deprecate). If cell_type is missing from the file, it will be generated from author_annotation

schema

Name of schema to be mapped to cellxgene. Naming must match the columns in schema mapping.

url

URL or path to the h5ad file. For data from CELLxGENE data: cellxgene when collection_id and dataset_id are defined. For HCA DCP data dcp when project_uuid is defined.

collection_id

Only for url=cellxgene, if data should be downloaded directly from CELLxGENE.

dataset_id

Only for url=cellxgene, if data should be downloaded directly from CELLxGENE.

project_uuid

Only for url=dcp, if data should be downloaded directly from HCA DCP data portal

annotation_file

Optional. Any additional annotations that are not in the AnnData object. Needs to have a matching barcode column.

barcode_column

Optional. Column in AnnData.obs for merging external annotations

All other columns are optional and will be added to AnnData.uns['meta'].

Schema Mapping#

The data loader ensures that the data adheres to the CELLxGENE schema 3.0.0 specifications. For datasets that do not adhere to that schema, the schema mapping file allows to provide a mapping of custom AnnData.obs columns to the ones defined in CELLxGENE. Below is an example of a schema mapping for the schemas custom and dcp to cellxgene. The column names are used to map the dataset to its corresponding schema.

cellxgene                custom                       dcp
study                    study_PI                     project.contributors.name
sample                   sample_ID                    specimen_from_organism.biomaterial_core.biomaterial_id
donor_id                 subject_ID                   donor_organism.biomaterial_core.biomaterial_id
development_stage        subject_developmental_state  donor_organism.development_stage.text
sex                      sex                          donor_organism.sex
self_reported_ethnicity  ethnicity_free_text          donor_organism.human_specific.ethnicity.text
suspension_type          biological_unit              library_preparation_protocol.nucleic_acid_source
assay                    library_platform             library_preparation_protocol.library_construction_method.text
organism                 species                      donor_organism.genus_species.text
cell_type                cell_type
...                      ...                          ...

DCP metadata (optional)#

This file is optional and used for datasets for which users want to map additional HCA DCP metadata annotations. The mapping should contain a study and a filename column, where filename is a TSV file that follows the DCP metadata schema. The mapping does not have to include all the studies that you want to include in your analysis.

TODO: extend to other metadata input.

Configuration file#

The config.yaml file is the main configuration file of the pipeline and is included in the top-level workflow by default. For data loading, you just need to define the location of the files defined above. By default, configs/imported/config.yaml should already contain the correct paths to the previously described input files. If you configured your datasets with different files than listed below, you need to update them accordingly.

dataset_meta: configs/datasets.tsv
schema_file: data/input/schema_mapping.tsv
dcp_metadata: configs/dcp_metadata.tsv

filter_per_organ:
...

filter_per_study:
...

For more information on Snakemake configuration files, please refer to the documentation.

Testing#

The test configuration and command are under `tests/’. All paths in the following are relative to the module root directory.

Prepare test data#

Before running the test pipeline for the first time, you must download a test dataset. The following script downloads the “SchulteSchrepping” dataset and then copies it to the location that is defined in dataset.tsv.

bash test/download_test_data.sh -c1

This needs to be done only once.

Run pipeline on test configuration#

Activate the snakemake environment and call test/run_test_*.sh with run specific Snakemake parameters.

conda activate snakemake
bash test/run_test_cellxgene.sh -n
bash test/run_test_cellxgene.sh -c