SampleCollection

Data export, analysis and visualization functions are all contained within the SampleCollection class.

A SampleCollection is returned whenever multiple Samples are returned via the One Codex API using a model.

Usage

SampleCollection contains useful tools for data export, analysis, visualization and statistics. See the following sections for more information:

import onecodex

ocx = onecodex.Api()

project = ocx.Project.get("d53ad03b010542e3")
samples = ocx.Samples.where(project=project)

type(samples) # SampleCollection

A SampleCollection can also be created manually from a list of samples:

import onecodex.models.collection.SampleCollection

sample_list = [
    ocx.Samples.get("cee3b512605a43c6"),
    ocx.Samples.get("01f703ac505e4a30")
]

samples = SampleCollection(sample_list)

# convert classification results to a Pandas DataFrame
samples.to_df()

filter

SampleCollection.filter(filter_func)

Return a new SampleCollection containing only samples meeting the filter criteria.

Will pass any kwargs (e.g., metric or skip_missing) used when instantiating the current class on to the new SampleCollection that is returned.

Parameters

filter_funccallable

A function that will be evaluated on every object in the collection. The function must return a bool. If True, the object will be kept. If False, it will be removed from the SampleCollection that is returned.

Returns

onecodex.models.SampleCollection containing only objects filter_func returned True on.

Examples

Generate a new collection of Samples that have a specific filename extension:

>>> new_collection = samples.filter(lambda s: s.filename.endswith('.fastq.gz'))

to_otu

SampleCollection.to_otu(biom_id: str | None = None, include_ranks: tuple[str] = ('superkingdom', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'), metric: Metric = auto)

Generate a BIOM-formatted data structure.

Parameters

biom_idstring, optional

Optionally specify an id field for the generated v1 BIOM file.

include_rankslist

A list of ranks to include in the taxonomy/OTU table. Uses onecodex.models.collection.CANONICAL_RANKS by default.

Returns

otu_tableOrderedDict

A BIOM OTU table, returned as a Python OrderedDict (can be dumped to JSON)

to_df

SampleCollection.to_df(analysis_type: str | AnalysisType = 'classification', **kwargs) pd.DataFrame

Transform Analyses of samples in a SampleCollection into a tabular DataFrame.

Parameters

analysis_type{‘classification’, ‘functional’}, default=’classification’

The type of analysis to aggregate.

**kwargs

Keyword arguments passed to the specific aggregation method.

Common Arguments:

  • metric (str | Metric): The metric to aggregate (default: Metric.Auto).

  • fill_missing (bool): Whether to fill missing values (default: True).

  • filler (Any): Value to use for filling missing values (default: 0).

If analysis_type=’classification’:

  • rank (Rank | str): Taxonomic rank to aggregate at (default: Rank.Auto).

  • top_n (int, optional): Return only the top N taxa by abundance.

  • threshold (float, optional): Filter taxa below this abundance threshold.

  • remove_zeros (bool): Remove taxa with zero abundance (default: True).

  • include_host (bool): Include host reads in the output (default: False).

  • table_format ({‘wide’, ‘long’}): The shape of the output DataFrame.

  • include_taxa_missing_rank (bool): Include taxa unspecified at the target rank.

If analysis_type=’functional’:

  • annotation (str | FunctionalAnnotations): The functional annotation database (default: Pathways).

  • taxa_stratified (bool): Whether to include taxonomic stratification (default: True).

Returns

pd.DataFrame

A DataFrame containing the aggregated classification or functional results.

See Also

to_classification_df : Underlying method for classification extraction. to_functional_df : Underlying method for functional extraction.

to_classification_df

SampleCollection.to_classification_df(rank: Rank | str = auto, top_n: int | None = None, threshold: float | None = None, remove_zeros: bool = True, include_host: bool = False, table_format: Literal['wide', 'long'] = 'wide', include_taxa_missing_rank: bool = False, fill_missing: bool = True, filler: Any = 0, metric: Metric = auto)

Generate a ClassificationsDataFrame, performing any specified transformations.

Takes the ClassificationsDataFrame associated with these samples, or SampleCollection, does some filtering, and returns a ClassificationsDataFrame copy.

Parameters

rankRank, optional

Analysis will be restricted to abundances of taxa at the specified level. See Rank for details.

top_ninteger, optional

Return only the top N most abundant taxa.

metricMetric, optional

The taxonomic abundance metric to use. See Metric for definitions.

thresholdfloat, optional

Return only taxa more abundant than this threshold in one or more samples.

remove_zerosbool, optional

Do not return taxa that have zero abundance in every sample.

include_hostbool, optional

Include host reads in the analysis.

table_format{‘long’, ‘wide’}, optional

If wide, rows are classifications, cols are taxa, elements are counts. If long, rows are observations with three cols each: classification_id, tax_id, and count.

include_taxa_missing_rankbool, optional

Whether or not to include taxa that do not have a designated parent at rank (will be grouped into a “No <rank>” column).

fill_missingbool, optional

Fill np.nan values.

fillerfloat, optional

Value with which to fill np.nans.

Returns

ClassificationsDataFrame

to_functional_df

SampleCollection.to_functional_df(annotation: FunctionalAnnotations = pathways, taxa_stratified: bool = True, metric: FunctionalAnnotationsMetric = coverage, fill_missing: bool = True, filler: Any = 0)

Generate a FunctionalDataFrame associated with functional analysis results.

Parameters

annotation:class:onecodex.lib.enum.FunctionalAnnotations, str}, optional

Annotation data to return, defaults to pathways

taxa_stratifiedbool, optional

Return taxonomically stratified data, defaults to True

metric{onecodex.lib.enum.FunctionalAnnotationsMetric, str}, optional

Metric values to return {‘coverage’, ‘abundance’} for annotation==FunctionalAnnotations.Pathways or {‘rpk’, ‘cpm’} for other annotations, defaults to coverage

fill_missingbool, optional

Fill np.nan values

fillerfloat, optional

Value with which to fill np.nans