SampleCollection

Data export, analysis and visualization functions are all contained within the SampleCollection class.

A SampleCollection is returned whenever multiple Samples are returned via the One Codex API using a model.

Usage

SampleCollection contains useful tools for data export, analysis, visualization and statistics. See the following sections for more information:

import onecodex

ocx = onecodex.Api()

project = ocx.Project.get("d53ad03b010542e3")
samples = ocx.Samples.where(project=project)

type(samples) # SampleCollection

A SampleCollection can also be created manually from a list of samples:

import onecodex.models.collection.SampleCollection

sample_list = [
    ocx.Samples.get("cee3b512605a43c6"),
    ocx.Samples.get("01f703ac505e4a30")
]

samples = SampleCollection(sample_list)

# convert classification results to a Pandas DataFrame
samples.to_df()

`filter`

SampleCollection.filter(filter_func)

Return a new SampleCollection containing only samples meeting the filter criteria.

Will pass any kwargs (e.g., metric or skip_missing) used when instantiating the current class on to the new SampleCollection that is returned.

Parameters

filter_funccallable: A function that will be evaluated on every object in the collection. The function must return a bool. If True, the object will be kept. If False, it will be removed from the SampleCollection that is returned.

Returns

onecodex.models.SampleCollection containing only objects filter_func returned True on.

Examples

Generate a new collection of Samples that have a specific filename extension:

>>> new_collection = samples.filter(lambda s: s.filename.endswith('.fastq.gz'))

`to_otu`

SampleCollection.to_otu(biom_id=None, include_ranks=('superkingdom', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'))

Generate a BIOM-formatted data structure.

Parameters

biom_idstring, optional: Optionally specify an id field for the generated v1 BIOM file.
include_rankslist: A list of ranks to include in the taxonomy/OTU table. Uses onecodex.models.collection.CANONICAL_RANKS by default.

Returns

otu_tableOrderedDict: A BIOM OTU table, returned as a Python OrderedDict (can be dumped to JSON)

`to_df`

SampleCollection.to_df(analysis_type=AnalysisType.Classification, **kwargs)

Transform Analyses of samples in a SampleCollection into tabular format.

Parameters

analysis_type{‘classification’, ‘functional’}, optional: The analysis_type to aggregate, corresponding to AnalysisJob.analysis_type
kwargsdict, optional: Keyword arguments specific to the analysis_type; see each individual function definition

`to_classification_df`

SampleCollection.to_classification_df(rank: Rank = Rank.Auto, top_n: int | None = None, threshold: float | None = None, remove_zeros: bool = True, normalize: Literal['auto'] | bool = 'auto', table_format: Literal['wide', 'long'] = 'wide', include_taxa_missing_rank: bool = False, fill_missing: bool = True, filler: Any = 0)

Generate a ClassificationsDataFrame, performing any specified transformations.

Takes the ClassificationsDataFrame associated with these samples, or SampleCollection, does some filtering, and returns a ClassificationsDataFrame copy.

Parameters

rank{‘auto’, ‘superkingdom’, ‘kingdom’, ‘phylum’, ‘class’, ‘order’, ‘family’, ‘genus’, ‘species’}, optional: Analysis will be restricted to abundances of taxa at the specified level.
top_ninteger, optional: Return only the top N most abundant taxa.
thresholdfloat, optional: Return only taxa more abundant than this threshold in one or more samples.
remove_zerosbool, optional: Do not return taxa that have zero abundance in every sample.
normalize{‘auto’, True, False}: Convert read counts to relative abundances (each sample sums to 1.0). If data has already been normalized, passing normalize=False will raise an error. To generate denormalized data, please create a new SampleCollection with metric="readcount" or metric="readcount_w_children".
table_format{‘long’, ‘wide’}: If wide, rows are classifications, cols are taxa, elements are counts. If long, rows are observations with three cols each: classification_id, tax_id, and count.
include_taxa_missing_rankbool, optional: Whether or not to include taxa that do not have a designated parent at rank (will be grouped into a “No <rank>” column).
fill_missingbool, optional: Fill np.nan values
fillerfloat, optional: Value with which to fill np.nans

Returns

ClassificationsDataFrame

`to_functional_df`

SampleCollection.to_functional_df(annotation: FunctionalAnnotations = FunctionalAnnotations.Pathways, taxa_stratified: bool = True, metric: FunctionalAnnotationsMetric = FunctionalAnnotationsMetric.Coverage, fill_missing: bool = True, filler: Any = 0)

Generate a FunctionalDataFrame associated with functional analysis results.

Parameters

annotation{onecodex.lib.enum.FunctionalAnnotations, str}, optional: Annotation data to return, defaults to pathways
taxa_stratifiedbool, optional: Return taxonomically stratified data, defaults to True
metric{onecodex.lib.enum.FunctionalAnnotationsMetric, str}, optional: Metric values to return {‘coverage’, ‘abundance’} for annotation==FunctionalAnnotations.Pathways or {‘rpk’, ‘cpm’} for other annotations, defaults to coverage
fill_missingbool, optional: Fill np.nan values
fillerfloat, optional: Value with which to fill np.nans

SampleCollection

Usage

filter

Parameters

Returns

Examples

to_otu

Parameters

Returns

to_df

Parameters

to_classification_df

Parameters

Returns

to_functional_df

Parameters

`filter`

`to_otu`

`to_df`

`to_classification_df`

`to_functional_df`