Statistics
See Also
Statistics functions are implemented as part of SampleCollection. For
more information, see SampleCollection.
Tests
aitchison_distance
- SampleCollection.aitchison_distance(metric: Metric = auto, rank: Rank = auto) skbio.stats.distance.DistanceMatrix
Calculate the Aitchison distance between samples.
Aitchison distance is the Euclidean distance between centre logratio-normalized samples (abundances). As this requires log-transforms, we first need to ‘estimate’ zeros in the data; i.e. replace zeros with small, positive values, while maintaining a constant sum to 1.
Parameters
Returns
skbio.stats.distance.DistanceMatrix, a distance matrix.
alpha_diversity
- SampleCollection.alpha_diversity(metric: Metric = auto, rank=auto, diversity_metric: AlphaDiversityMetric = shannon) pd.DataFrame
Calculate the diversity within a community.
Parameters
- rank
Rank, optional Analysis will be restricted to abundances of taxa at the specified level. See
Rankfor details.- metric:
Metric, optional The taxonomic abundance metric to use. See
Metricfor definitions.- diversity_metric
AlphaDiversityMetric Function to use when calculating the distance between two samples.
Returns
pandas.DataFrame, a distance matrix.
- rank
alpha_diversity_stats
- SampleCollection.alpha_diversity_stats(*, group_by: str | tuple[str, ...] | list[str], paired_by: str | tuple[str, ...] | list[str] | None = None, metric: Metric = auto, test: AlphaDiversityStatsTest = auto, diversity_metric: AlphaDiversityMetric = shannon, rank: Rank = auto, alpha: float = 0.05, require_classification_version_match: bool = True) AlphaDiversityStatsResults
Perform a test for significant differences between groups of alpha diversity values.
The following tests are supported:
Wilcoxon (2 groups, paired data)
Mann-Whitney U (2 groups, unpaired data)
Kruskal-Wallis with optional posthoc Dunn test (>=2 groups, unpaired data)
Parameters
- group_bystr or tuple of str or list of str
Metadata variable to group samples by. At least two groups are required. If group_by is a tuple or list, field values are joined with an underscore character (“_”).
- paired_bystr or tuple of str or list of str, optional
Metadata variable to pair samples in each group. May only be used with test=”wilcoxon”. If paired_by is a tuple or list, field values are joined with an underscore character (“_”).
- test
AlphaDiversityStatsTest, optional Stats test to perform. If ‘auto’, ‘mannwhitneyu’ will be chosen if there are two groups of unpaired data. ‘wilcoxon’ will be chosen if there are two groups and paired_by is specified. ‘kruskal’ will be chosen if there are more than 2 groups.
- rank
Rank, optional Analysis will be restricted to abundances of taxa at the specified level. See
Rankfor details.- metric:
Metric, optional The taxonomic abundance metric to use. See
Metricfor definitions.- diversity_metric
AlphaDiversityMetric Function to use when calculating the distance between two samples.
- alphafloat, optional
Threshold to determine statistical significance when test=”kruskal” (e.g. p < alpha). Must be between 0 and 1 (exclusive). If the Kruskal-Wallis p-value is significant and there are more than two groups, a posthoc Dunn test is performed.
- require_classification_version_matchbool, optional
If
True, require the same primary classification job ID across all samples included in the test.
Returns
See Also
scipy.stats.wilcoxon scipy.stats.mannwhitneyu scipy.stats.kruskal scikit_posthocs.posthoc_dunn
beta_diversity
- SampleCollection.beta_diversity(metric: Metric = auto, rank: Rank = auto, diversity_metric: BetaDiversityMetric = braycurtis) skbio.stats.distance.DistanceMatrix
Calculate the diversity between two communities.
Parameters
- rank
Rank, optional Analysis will be restricted to abundances of taxa at the specified level. See
Rankfor details.- metric:
Metric, optional The taxonomic abundance metric to use. See
Metricfor definitions.- diversity_metric
BetaDiversityMetric Function to use when calculating the distance between two samples.
Returns
skbio.stats.distance.DistanceMatrix, a distance matrix.
- rank
beta_diversity_stats
- SampleCollection.beta_diversity_stats(*, group_by: str | tuple[str, ...] | list[str], metric: Metric = auto, diversity_metric: BetaDiversityMetric = braycurtis, rank: Rank = auto, alpha: float = 0.05, num_permutations: int = 999, require_classification_version_match: bool = True) BetaDiversityStatsResults
Test for significant differences between groups of samples based on their distances.
Beta diversity distances between samples are computed and a PERMANOVA test is performed to assess whether there are significant differences between groups of samples. Posthoc pairwise PERMANOVA tests are performed if the global test is found to be statistically significant and there are more than two groups.
Parameters
- group_bystr or tuple of str or list of str
Metadata variable to group samples by. At least two groups are required. If group_by is a tuple or list, field values are joined with an underscore character (“_”).
- metric:
Metric, optional The taxonomic abundance metric to use. See
Metricfor definitions.- diversity_metric
BetaDiversityMetric Function to use when calculating the distance between two samples.
- rank
Rank, optional Analysis will be restricted to abundances of taxa at the specified level. See
Rankfor details.- alphafloat, optional
Threshold to determine statistical significance (e.g. p < alpha). Must be between 0 and 1 (exclusive). If the p-value is significant and there are more than two groups, posthoc pairwise PERMANOVA tests are performed.
- num_permutationsint, optional
Number of permutations to use when computing the p-value.
- require_classification_version_matchbool, optional
If
True, require the same primary classification job ID across all samples included in the test.
Returns
See Also
skbio.stats.distance.permanova scipy.stats.false_discovery_control
unifrac
- SampleCollection.unifrac(metric: Metric = auto, rank: Rank = auto, weighted: bool = True)
Calculate the UniFrac beta diversity metric.
UniFrac takes into account the relatedness of community members. Weighted UniFrac considers abundances, unweighted UniFrac considers presence.
Parameters
Returns
skbio.stats.distance.DistanceMatrix, a distance matrix.
Results
AlphaDiversityStatsResults
- class onecodex.stats.AlphaDiversityStatsResults(*, statistic: float, pvalue: float, alpha: float, sample_size: int, group_by_variable: str, group_sizes: dict[str, int], posthoc: PosthocResults | None = None, test: AlphaDiversityStatsTest, paired_by_variable: str | None = None)
A dataclass for storing the results of an alpha diversity stats test.
test: stats test that was performed
statistic: computed test statistic (e.g. U statistic if test=”mannwhitneyu”)
pvalue: computed p-value
alpha: p-value threshold used to determine whether to run a posthoc test when test=”kruskal”
sample_size: number of samples used in the test after filtering
group_by_variable: name of the variable used to group samples by
group_sizes: dict mapping group name to sample size in each group
paired_by_variable: name of the variable used to pair samples by (if the data were
paired) - posthoc:
PosthocResults
BetaDiversityStatsResults
- class onecodex.stats.BetaDiversityStatsResults(*, statistic: float, pvalue: float, alpha: float, sample_size: int, group_by_variable: str, group_sizes: dict[str, int], posthoc: PosthocResults | None = None, test: BetaDiversityStatsTest, num_permutations: int)
A dataclass for storing the results of a beta diversity test.
test: stats test that was performed
statistic: PERMANOVA pseudo-F test statistic
pvalue: p-value based on num_permutations
alpha: p-value threshold used to determine whether to run a posthoc test
num_permutations: number of permutations used to compute pvalue
sample_size: number of samples used in the test after filtering
group_by_variable: name of the variable used to group samples by
group_sizes: dict mapping group name to sample size in each group
posthoc:
PosthocResults
PosthocResults
- class onecodex.stats.PosthocResults(*, test: PosthocStatsTest, adjustment_method: AdjustmentMethod, adjusted_pvalues: pd.DataFrame, pvalues: pd.DataFrame | None = None, statistics: pd.DataFrame | None = None)
A dataclass for storing results from a posthoc statistical test.
test: posthoc stats test that was performed
adjustment_method: method used to adjust p-values to control the false discovery rate
statistics: pd.DataFrame containing pairwise test statistics. The index and columns are
sorted group names. - pvalues: pd.DataFrame containing unadjusted p-values. The index and columns are sorted group names. - adjusted_pvalues: pd.DataFrame containing adjusted p-values. p-values are adjusted for false discovery rate using adjustment_method. The index and columns are sorted group names.