SampleCollection
Data export, analysis and visualization functions are all contained within the
SampleCollection class.
A SampleCollection is returned whenever multiple Samples are returned via
the One Codex API using a model.
Usage
SampleCollection contains useful tools for data export, analysis,
visualization and statistics. See the following sections for more information:
import onecodex
ocx = onecodex.Api()
project = ocx.Project.get("d53ad03b010542e3")
samples = ocx.Samples.where(project=project)
type(samples) # SampleCollection
A SampleCollection can also be created manually from a list of samples:
import onecodex.models.collection.SampleCollection
sample_list = [
ocx.Samples.get("cee3b512605a43c6"),
ocx.Samples.get("01f703ac505e4a30")
]
samples = SampleCollection(sample_list)
# convert classification results to a Pandas DataFrame
samples.to_df()
filter
- SampleCollection.filter(filter_func)
Return a new SampleCollection containing only samples meeting the filter criteria.
Will pass any kwargs (e.g., metric or skip_missing) used when instantiating the current class on to the new SampleCollection that is returned.
Parameters
- filter_funccallable
A function that will be evaluated on every object in the collection. The function must return a bool. If True, the object will be kept. If False, it will be removed from the SampleCollection that is returned.
Returns
onecodex.models.SampleCollection containing only objects filter_func returned True on.
Examples
Generate a new collection of Samples that have a specific filename extension:
>>> new_collection = samples.filter(lambda s: s.filename.endswith('.fastq.gz'))
to_otu
- SampleCollection.to_otu(biom_id: str | None = None, include_ranks: tuple[str] = ('superkingdom', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'), metric: Metric = auto)
Generate a BIOM-formatted data structure.
Parameters
- biom_idstring, optional
Optionally specify an id field for the generated v1 BIOM file.
- include_rankslist
A list of ranks to include in the taxonomy/OTU table. Uses onecodex.models.collection.CANONICAL_RANKS by default.
Returns
- otu_tableOrderedDict
A BIOM OTU table, returned as a Python OrderedDict (can be dumped to JSON)
to_df
- SampleCollection.to_df(analysis_type: str | AnalysisType = 'classification', **kwargs) pd.DataFrame
Transform Analyses of samples in a
SampleCollectioninto a tabular DataFrame.Parameters
- analysis_type{‘classification’, ‘functional’}, default=’classification’
The type of analysis to aggregate.
- **kwargs
Keyword arguments passed to the specific aggregation method.
Common Arguments:
metric (str | Metric): The metric to aggregate (default: Metric.Auto).
fill_missing (bool): Whether to fill missing values (default: True).
filler (Any): Value to use for filling missing values (default: 0).
If analysis_type=’classification’:
rank (Rank | str): Taxonomic rank to aggregate at (default: Rank.Auto).
top_n (int, optional): Return only the top N taxa by abundance.
threshold (float, optional): Filter taxa below this abundance threshold.
remove_zeros (bool): Remove taxa with zero abundance (default: True).
include_host (bool): Include host reads in the output (default: False).
table_format ({‘wide’, ‘long’}): The shape of the output DataFrame.
include_taxa_missing_rank (bool): Include taxa unspecified at the target rank.
If analysis_type=’functional’:
annotation (str | FunctionalAnnotations): The functional annotation database (default: Pathways).
taxa_stratified (bool): Whether to include taxonomic stratification (default: True).
Returns
- pd.DataFrame
A DataFrame containing the aggregated classification or functional results.
See Also
to_classification_df : Underlying method for classification extraction. to_functional_df : Underlying method for functional extraction.
to_classification_df
- SampleCollection.to_classification_df(rank: Rank | str = auto, top_n: int | None = None, threshold: float | None = None, remove_zeros: bool = True, include_host: bool = False, table_format: Literal['wide', 'long'] = 'wide', include_taxa_missing_rank: bool = False, fill_missing: bool = True, filler: Any = 0, metric: Metric = auto)
Generate a ClassificationsDataFrame, performing any specified transformations.
Takes the ClassificationsDataFrame associated with these samples, or SampleCollection, does some filtering, and returns a ClassificationsDataFrame copy.
Parameters
- rank
Rank, optional Analysis will be restricted to abundances of taxa at the specified level. See
Rankfor details.- top_ninteger, optional
Return only the top N most abundant taxa.
- metric
Metric, optional The taxonomic abundance metric to use. See
Metricfor definitions.- thresholdfloat, optional
Return only taxa more abundant than this threshold in one or more samples.
- remove_zerosbool, optional
Do not return taxa that have zero abundance in every sample.
- include_hostbool, optional
Include host reads in the analysis.
- table_format{‘long’, ‘wide’}, optional
If wide, rows are classifications, cols are taxa, elements are counts. If long, rows are observations with three cols each: classification_id, tax_id, and count.
- include_taxa_missing_rankbool, optional
Whether or not to include taxa that do not have a designated parent at rank (will be grouped into a “No <rank>” column).
- fill_missingbool, optional
Fill np.nan values.
- fillerfloat, optional
Value with which to fill np.nans.
Returns
ClassificationsDataFrame- rank
to_functional_df
- SampleCollection.to_functional_df(annotation: FunctionalAnnotations = pathways, taxa_stratified: bool = True, metric: FunctionalAnnotationsMetric = coverage, fill_missing: bool = True, filler: Any = 0)
Generate a FunctionalDataFrame associated with functional analysis results.
Parameters
- annotation:class:onecodex.lib.enum.FunctionalAnnotations, str}, optional
Annotation data to return, defaults to pathways
- taxa_stratifiedbool, optional
Return taxonomically stratified data, defaults to True
- metric{onecodex.lib.enum.FunctionalAnnotationsMetric, str}, optional
Metric values to return {‘coverage’, ‘abundance’} for annotation==FunctionalAnnotations.Pathways or {‘rpk’, ‘cpm’} for other annotations, defaults to coverage
- fill_missingbool, optional
Fill np.nan values
- fillerfloat, optional
Value with which to fill np.nans