Filter#

Filtering#

This workflow filters the input data based on a set of conditions. The conditions can be defined in the config under the filter module.

Configuration#

DATASETS:
  Lee2020:
    input:
      filter: test/input/load_data/harmonize_metadata/Lee2020.zarr
    filter:
      subset: true ## whether to subset the file according to the filters
      remove_by_column:
        donor: # column name
          # entries from column to exclude (will be treated as String)
          - C19-CB-0008 # value not in data
          - Normal 1
          - nCoV 1
          - Flu 1
      keep_by_column:
        sex:
          - female
      remove_by_query:
        # pandas query strings for complex filtering conditions
        - 'random < 3'
      keep_by_query:
        # pandas query strings for cells to keep
        - 'disease == "influenza"'
  test:
    input:
      filter: test/input/pbmc68k.h5ad
    filter:
      subset: true
      remove_by_column:
        phase:
          - G1
        is_cd14_mono:
          - true

Parameters#

  • subset: Boolean flag that determines whether to physically subset the file according to the filters

    • true (default): Creates a new subset file with filtered data

    • false: Only applies filters without modifying the original file structure

  • write_copy: Boolean flag for output format when subsetting

    • true: Writes a full copy of the data

    • false (default): Writes linked zarr format when possible

  • remove_by_column: Dictionary defining filtering conditions

    • Keys: Column names in the AnnData observation metadata (adata.obs)

    • Values: List of entries to exclude from that column

    • All entries are treated as strings for comparison

  • keep_by_column: Dictionary defining columns and entries to retain

    • Keys: Column names in the AnnData observation metadata (adata.obs)

    • Values: List of entries to keep from that column

    • All entries are treated as strings for comparison

  • remove_by_query: List of pandas query strings for complex filtering conditions

    • Each query string follows pandas DataFrame query syntax

    • Cells matching these queries will be removed

  • keep_by_query: List of pandas query strings for cells to retain

    • Each query string follows pandas DataFrame query syntax

    • Only cells matching these queries will be kept

Behavior#

  • All remove_by filtering conditions are combined with AND logic - a cell must pass ALL filters to be retained

    • Within remove_by_column: cells are excluded if they match ANY value in ANY specified column

    • Within remove_by_query: cells are excluded if they match ANY of the query conditions

  • All keep_by filtering conditions are combined with OR logic - a cell only needs to pass ONE filter to be retained

    • Within keep_by_column: cells are retained if they match ANY value in ANY specified column

    • Within keep_by_query: cells are retained only if they match ANY of the query conditions

  • The final mask keeps cells that:

    1. If keep_by filters are specified: DO match at least ONE condition from keep_by_column OR keep_by_query, AND

    2. Do NOT match any values in remove_by_column columns, AND

    3. Do NOT match any remove_by_query conditions.

    If no keep_by filters are specified, all cells are initially kept and only remove_by filters are applied.

Input/Output#

  • Input: AnnData files in .zarr or .h5ad format

  • Output: Filtered AnnData file (subset if subset: true, otherwise original structure with filter metadata)

  • A filtered column is added to adata.obs indicating which cells passed the filters