pyranges.statistics
Statistics useful for genomics.
Module Contents
Classes
Namespace for statistical comparsion-operations. |
Functions
|
Adjust p-values with Benjamini-Hochberg. |
|
Fisher's exact for contingency tables. |
|
Compute Matthew's correlation coefficient for PyRanges overlaps. |
|
Fast row-based Spearman's correlation. |
|
Fast row-based Pearson's correlation. |
|
Rank order of entries in each row. |
|
Apply Simes method for giving dependent events a p-value. |
- pyranges.statistics.fdr(p_vals)
Adjust p-values with Benjamini-Hochberg.
- Parameters:
data (array-like) –
- Returns:
DataFrame where values are order of data.
- Return type:
Pandas.DataFrame
Examples
>>> d = {'Chromosome': ['chr3', 'chr6', 'chr13'], 'Start': [146419383, 39800100, 24537618], 'End': [146419483, 39800200, 24537718], 'Strand': ['-', '+', '-'], 'PValue': [0.0039591368855297175, 0.0037600512992788937, 0.0075061166500909205]} >>> gr = pr.from_dict(d) >>> gr +--------------+-----------+-----------+--------------+-------------+ | Chromosome | Start | End | Strand | PValue | | (category) | (int64) | (int64) | (category) | (float64) | |--------------+-----------+-----------+--------------+-------------| | chr3 | 146419383 | 146419483 | - | 0.00395914 | | chr6 | 39800100 | 39800200 | + | 0.00376005 | | chr13 | 24537618 | 24537718 | - | 0.00750612 | +--------------+-----------+-----------+--------------+-------------+ Stranded PyRanges object has 3 rows and 5 columns from 3 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand.
>>> gr.FDR = pr.stats.fdr(gr.PValue) >>> gr.print(formatting={"PValue": "{:.4f}", "FDR": "{:.4}"}) +--------------+-----------+-----------+--------------+-------------+-------------+ | Chromosome | Start | End | Strand | PValue | FDR | | (category) | (int64) | (int64) | (category) | (float64) | (float64) | |--------------+-----------+-----------+--------------+-------------+-------------| | chr3 | 146419383 | 146419483 | - | 0.004 | 0.005939 | | chr6 | 39800100 | 39800200 | + | 0.0038 | 0.01128 | | chr13 | 24537618 | 24537718 | - | 0.0075 | 0.007506 | +--------------+-----------+-----------+--------------+-------------+-------------+ Stranded PyRanges object has 3 rows and 6 columns from 3 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand.
- pyranges.statistics.fisher_exact(tp, fp, fn, tn, pseudocount=0)
Fisher’s exact for contingency tables.
Computes the hypotheses two-sided, less and greater at the same time.
The odds-ratio is
- Parameters:
tp (array-like of int) – Top left square of contingency table (true positives).
fp (array-like of int) – Top right square of contingency table (false positives).
fn (array-like of int) – Bottom left square of contingency table (false negatives).
tn (array-like of int) – Bottom right square of contingency table (true negatives).
pseudocount (float, default 0) – Values > 0 allow Odds Ratio to always be a finite number.
Notes
The odds-ratio is computed thusly:
((tp + pseudocount) / (fp + pseudocount)) / ((fn + pseudocount) / (tn + pseudocount))
- Returns:
DataFrame with columns OR and P, PLeft and PRight.
- Return type:
pandas.DataFrame
See also
pr.stats.fdr
correct for multiple testing
Examples
>>> d = {"TP": [12, 0], "FP": [5, 12], "TN": [29, 10], "FN": [2, 2]} >>> df = pd.DataFrame(d) >>> df TP FP TN FN 0 12 5 29 2 1 0 12 10 2
>>> pr.stats.fisher_exact(df.TP, df.FP, df.TN, df.FN) OR P PLeft PRight 0 0.165517 0.080269 0.044555 0.994525 1 0.000000 0.000067 0.000034 1.000000
- pyranges.statistics.mcc(grs, genome=None, labels=None, strand=False, verbose=False)
Compute Matthew’s correlation coefficient for PyRanges overlaps.
- Parameters:
grs (list of PyRanges) – PyRanges to compare.
genome (DataFrame or dict, default None) – Should contain chromosome sizes. By default, end position of the rightmost intervals are used as proxies for the chromosome size, but it is recommended to use a genome.
labels (list of str, default None) – Names to give the PyRanges in the output.
strand (bool, default False) – Whether to compute correlations per strand.
verbose (bool, default False) – Warn if some chromosomes are in the genome, but not in the PyRanges.
Examples
>>> grs = [pr.data.aorta(), pr.data.aorta(), pr.data.aorta2()] >>> mcc = pr.stats.mcc(grs, labels="abc", genome={"chr1": 2100000}) >>> mcc T F TP FP TN FN MCC 0 a a 728 0 2099272 0 1.00000 1 a b 728 0 2099272 0 1.00000 3 a c 457 485 2098787 271 0.55168 2 b a 728 0 2099272 0 1.00000 5 b b 728 0 2099272 0 1.00000 6 b c 457 485 2098787 271 0.55168 4 c a 457 271 2098787 485 0.55168 7 c b 457 271 2098787 485 0.55168 8 c c 942 0 2099058 0 1.00000
To create a symmetric matrix (useful for heatmaps of correlations):
>>> mcc.set_index(["T", "F"]).MCC.unstack().rename_axis(None, axis=0) F a b c a 1.00000 1.00000 0.55168 b 1.00000 1.00000 0.55168 c 0.55168 0.55168 1.00000
- pyranges.statistics.rowbased_spearman(x, y)
Fast row-based Spearman’s correlation.
- Parameters:
x (matrix-like) – 2D numerical matrix. Same size as y.
y (matrix-like) – 2D numerical matrix. Same size as x.
- Returns:
Array with same length as input, where values are P-values.
- Return type:
numpy.array
See also
pyranges.statistics.rowbased_pearson
fast row-based Pearson’s correlation.
pr.stats.fdr
correct for multiple testing
Examples
>>> x = np.array([[7, 2, 9], [3, 6, 0], [0, 6, 3]]) >>> y = np.array([[5, 3, 2], [9, 6, 0], [7, 3, 5]])
Perform Spearman’s correlation pairwise on each row in 10x10 matrixes:
>>> pr.stats.rowbased_spearman(x, y) array([-0.5, 0.5, -1. ])
- pyranges.statistics.rowbased_pearson(x, y)
Fast row-based Pearson’s correlation.
- Parameters:
x (matrix-like) – 2D numerical matrix. Same size as y.
y (matrix-like) – 2D numerical matrix. Same size as x.
- Returns:
Array with same length as input, where values are P-values.
- Return type:
numpy.array
See also
pyranges.statistics.rowbased_spearman
fast row-based Spearman’s correlation.
pr.stats.fdr
correct for multiple testing
Examples
>>> x = np.array([[7, 2, 9], [3, 6, 0], [0, 6, 3]]) >>> y = np.array([[5, 3, 2], [9, 6, 0], [7, 3, 5]])
Perform Pearson’s correlation pairwise on each row in 10x10 matrixes:
>>> pr.stats.rowbased_pearson(x, y) array([-0.09078413, 0.65465367, -1. ])
- pyranges.statistics.rowbased_rankdata(data)
Rank order of entries in each row.
Same as SciPy rankdata with method=mean.
- Parameters:
data (matrix-like) – The data to find order of.
- Returns:
DataFrame where values are order of data.
- Return type:
Pandas.DataFrame
Examples
>>> x = np.random.randint(10, size=(3, 4)) >>> x = np.array([[3, 7, 6, 0], [1, 3, 8, 9], [5, 9, 3, 5]]) >>> pr.stats.rowbased_rankdata(x) 0 1 2 3 0 2.0 4.0 3.0 1.0 1 1.0 2.0 3.0 4.0 2 2.5 4.0 1.0 2.5
- pyranges.statistics.simes(df, groupby, pcol, keep_position=False)
Apply Simes method for giving dependent events a p-value.
- Parameters:
df (pandas.DataFrame) – Data to analyse with Simes.
groupby (str or list of str) – Features equal in these columns will be merged with Simes.
pcol (str) – Name of column with p-values.
keep_position (bool, default False) – Keep columns “Chromosome”, “Start”, “End” and “Strand” if they exist.
See also
pr.stats.fdr
correct for multiple testing
Examples
>>> s = '''Chromosome Start End Strand Gene PValue ... 1 10 20 + P53 0.0001 ... 1 20 20 + P53 0.0002 ... 1 30 20 + P53 0.0003 ... 2 60 65 - FOX 0.05 ... 2 70 75 - FOX 0.0000001 ... 2 80 90 - FOX 0.0000021'''
>>> gr = pr.from_string(s) >>> gr +--------------+-----------+-----------+--------------+------------+-------------+ | Chromosome | Start | End | Strand | Gene | PValue | | (category) | (int64) | (int64) | (category) | (object) | (float64) | |--------------+-----------+-----------+--------------+------------+-------------| | 1 | 10 | 20 | + | P53 | 0.0001 | | 1 | 20 | 20 | + | P53 | 0.0002 | | 1 | 30 | 20 | + | P53 | 0.0003 | | 2 | 60 | 65 | - | FOX | 0.05 | | 2 | 70 | 75 | - | FOX | 1e-07 | | 2 | 80 | 90 | - | FOX | 2.1e-06 | +--------------+-----------+-----------+--------------+------------+-------------+ Stranded PyRanges object has 6 rows and 6 columns from 2 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand.
>>> simes = pr.stats.simes(gr.df, "Gene", "PValue") >>> simes Gene Simes 0 FOX 3.000000e-07 1 P53 3.000000e-04
>>> gr.apply(lambda df: ... pr.stats.simes(df, "Gene", "PValue", keep_position=True)) +--------------+-----------+-----------+-------------+--------------+------------+ | Chromosome | Start | End | Simes | Strand | Gene | | (category) | (int64) | (int64) | (float64) | (category) | (object) | |--------------+-----------+-----------+-------------+--------------+------------| | 1 | 10 | 20 | 0.0001 | + | P53 | | 2 | 60 | 90 | 1e-07 | - | FOX | +--------------+-----------+-----------+-------------+--------------+------------+ Stranded PyRanges object has 2 rows and 6 columns from 2 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand.
- class pyranges.statistics.StatisticsMethods(pr)
Namespace for statistical comparsion-operations.
Accessed with gr.stats.
- pr
- forbes(other, chromsizes, strandedness=None)
Compute Forbes coefficient.
Ratio which represents observed versus expected co-occurence.
Described in
Forbes SA (1907): On the local distribution of certain Illinois fishes: an essay in statistical ecology.
- Parameters:
other (PyRanges) – Intervals to compare with.
chromsizes (int, dict, DataFrame or PyRanges) – Integer representing genome length or mapping from chromosomes to its length.
strandedness ({None, "same", "opposite", False}, default None, i.e. "auto") – Whether to compute without regards to strand or on same or opposite.
- Returns:
Ratio of observed versus expected co-occurence.
- Return type:
float
See also
pyranges.statistics.jaccard
compute the jaccard coefficient
Examples
>>> gr, gr2 = pr.data.chipseq(), pr.data.chipseq_background() >>> chromsizes = pr.data.chromsizes() >>> gr.stats.forbes(gr2, chromsizes=chromsizes) 1.7168314674978278
- jaccard(other, **kwargs)
Compute Jaccards coefficient.
Ratio of the intersection and union of two sets.
- Parameters:
other (PyRanges) – Intervals to compare with.
chromsizes (int, dict, DataFrame or PyRanges) – Integer representing genome length or mapping from chromosomes to its length.
strandedness ({None, "same", "opposite", False}, default None, i.e. "auto") – Whether to compute without regards to strand or on same or opposite.
- Returns:
Ratio of the intersection and union of two sets.
- Return type:
float
See also
pyranges.statistics.forbes
compute the forbes coefficient
Examples
>>> gr, gr2 = pr.data.chipseq(), pr.data.chipseq_background() >>> chromsizes = pr.data.chromsizes() >>> gr.stats.jaccard(gr2, chromsizes=chromsizes) 6.657941988519211e-05
- relative_distance(other, **kwargs)
Compute spatial correllation between two sets.
Metric which describes relative distance between each interval in one set and two closest intervals in another.
- Parameters:
other (PyRanges) – Intervals to compare with.
chromsizes (int, dict, DataFrame or PyRanges) – Integer representing genome length or mapping from chromosomes to its length.
strandedness ({None, "same", "opposite", False}, default None, i.e. "auto") – Whether to compute without regards to strand or on same or opposite.
- Returns:
DataFrame containing the frequency of each relative distance.
- Return type:
pandas.DataFrame
See also
pyranges.statistics.jaccard
compute the jaccard coefficient
pyranges.statistics.forbes
compute the forbes coefficient
Examples
>>> gr, gr2 = pr.data.chipseq(), pr.data.chipseq_background() >>> chromsizes = pr.data.chromsizes() >>> gr.stats.relative_distance(gr2) reldist count total fraction 0 0.00 264 9956 0.026517 1 0.01 226 9956 0.022700 2 0.02 206 9956 0.020691 3 0.03 235 9956 0.023604 4 0.04 194 9956 0.019486 5 0.05 241 9956 0.024207 6 0.06 201 9956 0.020189 7 0.07 191 9956 0.019184 8 0.08 192 9956 0.019285 9 0.09 191 9956 0.019184 10 0.10 186 9956 0.018682 11 0.11 203 9956 0.020390 12 0.12 218 9956 0.021896 13 0.13 209 9956 0.020992 14 0.14 201 9956 0.020189 15 0.15 178 9956 0.017879 16 0.16 202 9956 0.020289 17 0.17 197 9956 0.019787 18 0.18 208 9956 0.020892 19 0.19 202 9956 0.020289 20 0.20 191 9956 0.019184 21 0.21 188 9956 0.018883 22 0.22 213 9956 0.021394 23 0.23 192 9956 0.019285 24 0.24 199 9956 0.019988 25 0.25 181 9956 0.018180 26 0.26 172 9956 0.017276 27 0.27 191 9956 0.019184 28 0.28 190 9956 0.019084 29 0.29 192 9956 0.019285 30 0.30 201 9956 0.020189 31 0.31 212 9956 0.021294 32 0.32 213 9956 0.021394 33 0.33 177 9956 0.017778 34 0.34 197 9956 0.019787 35 0.35 163 9956 0.016372 36 0.36 191 9956 0.019184 37 0.37 198 9956 0.019888 38 0.38 160 9956 0.016071 39 0.39 188 9956 0.018883 40 0.40 200 9956 0.020088 41 0.41 188 9956 0.018883 42 0.42 230 9956 0.023102 43 0.43 197 9956 0.019787 44 0.44 224 9956 0.022499 45 0.45 184 9956 0.018481 46 0.46 198 9956 0.019888 47 0.47 187 9956 0.018783 48 0.48 200 9956 0.020088 49 0.49 194 9956 0.019486