Filter#
Filtering#
This workflow filters the input data based on a set of conditions.
The conditions can be defined in the config under the filter module.
Configuration#
DATASETS:
Lee2020:
input:
filter: test/input/load_data/harmonize_metadata/Lee2020.zarr
filter:
subset: true ## whether to subset the file according to the filters
remove_by_column:
donor: # column name
# entries from column to exclude (will be treated as String)
- C19-CB-0008 # value not in data
- Normal 1
- nCoV 1
- Flu 1
keep_by_column:
sex:
- female
remove_by_query:
# pandas query strings for complex filtering conditions
- 'random < 3'
keep_by_query:
# pandas query strings for cells to keep
- 'disease == "influenza"'
test:
input:
filter: test/input/pbmc68k.h5ad
filter:
subset: true
remove_by_column:
phase:
- G1
is_cd14_mono:
- true
Parameters#
subset: Boolean flag that determines whether to physically subset the file according to the filterstrue(default): Creates a new subset file with filtered datafalse: Only applies filters without modifying the original file structure
write_copy: Boolean flag for output format when subsettingtrue: Writes a full copy of the datafalse(default): Writes linked zarr format when possible
remove_by_column: Dictionary defining filtering conditionsKeys: Column names in the AnnData observation metadata (
adata.obs)Values: List of entries to exclude from that column
All entries are treated as strings for comparison
keep_by_column: Dictionary defining columns and entries to retainKeys: Column names in the AnnData observation metadata (
adata.obs)Values: List of entries to keep from that column
All entries are treated as strings for comparison
remove_by_query: List of pandas query strings for complex filtering conditionsEach query string follows pandas DataFrame query syntax
Cells matching these queries will be removed
keep_by_query: List of pandas query strings for cells to retainEach query string follows pandas DataFrame query syntax
Only cells matching these queries will be kept
Behavior#
All
remove_byfiltering conditions are combined with AND logic - a cell must pass ALL filters to be retainedWithin
remove_by_column: cells are excluded if they match ANY value in ANY specified columnWithin
remove_by_query: cells are excluded if they match ANY of the query conditions
All
keep_byfiltering conditions are combined with OR logic - a cell only needs to pass ONE filter to be retainedWithin
keep_by_column: cells are retained if they match ANY value in ANY specified columnWithin
keep_by_query: cells are retained only if they match ANY of the query conditions
The final mask keeps cells that:
If
keep_byfilters are specified: DO match at least ONE condition fromkeep_by_columnORkeep_by_query, ANDDo NOT match any values in
remove_by_columncolumns, ANDDo NOT match any
remove_by_queryconditions.
If no
keep_byfilters are specified, all cells are initially kept and onlyremove_byfilters are applied.
Input/Output#
Input: AnnData files in
.zarror.h5adformatOutput: Filtered AnnData file (subset if
subset: true, otherwise original structure with filter metadata)A
filteredcolumn is added toadata.obsindicating which cells passed the filters