pyranges.statistics

Statistics useful for genomics.

Module Contents

Classes

StatisticsMethods

Namespace for statistical comparsion-operations.

Functions

fdr(p_vals)

Adjust p-values with Benjamini-Hochberg.

fisher_exact(tp, fp, fn, tn[, pseudocount])

Fisher's exact for contingency tables.

mcc(grs[, genome, labels, strand, verbose])

Compute Matthew's correlation coefficient for PyRanges overlaps.

rowbased_spearman(x, y)

Fast row-based Spearman's correlation.

rowbased_pearson(x, y)

Fast row-based Pearson's correlation.

rowbased_rankdata(data)

Rank order of entries in each row.

simes(df, groupby, pcol[, keep_position])

Apply Simes method for giving dependent events a p-value.

pyranges.statistics.fdr(p_vals)

Adjust p-values with Benjamini-Hochberg.

Parameters:

data (array-like) –

Returns:

DataFrame where values are order of data.

Return type:

Pandas.DataFrame

Examples

>>> d = {'Chromosome': ['chr3', 'chr6', 'chr13'], 'Start': [146419383, 39800100, 24537618], 'End': [146419483, 39800200, 24537718], 'Strand': ['-', '+', '-'], 'PValue': [0.0039591368855297175, 0.0037600512992788937, 0.0075061166500909205]}
>>> gr = pr.from_dict(d)
>>> gr
+--------------+-----------+-----------+--------------+-------------+
| Chromosome   |     Start |       End | Strand       |      PValue |
| (category)   |   (int64) |   (int64) | (category)   |   (float64) |
|--------------+-----------+-----------+--------------+-------------|
| chr3         | 146419383 | 146419483 | -            |  0.00395914 |
| chr6         |  39800100 |  39800200 | +            |  0.00376005 |
| chr13        |  24537618 |  24537718 | -            |  0.00750612 |
+--------------+-----------+-----------+--------------+-------------+
Stranded PyRanges object has 3 rows and 5 columns from 3 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
>>> gr.FDR = pr.stats.fdr(gr.PValue)
>>> gr.print(formatting={"PValue": "{:.4f}", "FDR": "{:.4}"})
+--------------+-----------+-----------+--------------+-------------+-------------+
| Chromosome   |     Start |       End | Strand       |      PValue |         FDR |
| (category)   |   (int64) |   (int64) | (category)   |   (float64) |   (float64) |
|--------------+-----------+-----------+--------------+-------------+-------------|
| chr3         | 146419383 | 146419483 | -            |      0.004  |    0.005939 |
| chr6         |  39800100 |  39800200 | +            |      0.0038 |    0.01128  |
| chr13        |  24537618 |  24537718 | -            |      0.0075 |    0.007506 |
+--------------+-----------+-----------+--------------+-------------+-------------+
Stranded PyRanges object has 3 rows and 6 columns from 3 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
pyranges.statistics.fisher_exact(tp, fp, fn, tn, pseudocount=0)

Fisher’s exact for contingency tables.

Computes the hypotheses two-sided, less and greater at the same time.

The odds-ratio is

Parameters:
  • tp (array-like of int) – Top left square of contingency table (true positives).

  • fp (array-like of int) – Top right square of contingency table (false positives).

  • fn (array-like of int) – Bottom left square of contingency table (false negatives).

  • tn (array-like of int) – Bottom right square of contingency table (true negatives).

  • pseudocount (float, default 0) – Values > 0 allow Odds Ratio to always be a finite number.

Notes

The odds-ratio is computed thusly:

((tp + pseudocount) / (fp + pseudocount)) / ((fn + pseudocount) / (tn + pseudocount))

Returns:

DataFrame with columns OR and P, PLeft and PRight.

Return type:

pandas.DataFrame

See also

pr.stats.fdr

correct for multiple testing

Examples

>>> d = {"TP": [12, 0], "FP": [5, 12], "TN": [29, 10], "FN": [2, 2]}
>>> df = pd.DataFrame(d)
>>> df
   TP  FP  TN  FN
0  12   5  29   2
1   0  12  10   2
>>> pr.stats.fisher_exact(df.TP, df.FP, df.TN, df.FN)
         OR         P     PLeft    PRight
0  0.165517  0.080269  0.044555  0.994525
1  0.000000  0.000067  0.000034  1.000000
pyranges.statistics.mcc(grs, genome=None, labels=None, strand=False, verbose=False)

Compute Matthew’s correlation coefficient for PyRanges overlaps.

Parameters:
  • grs (list of PyRanges) – PyRanges to compare.

  • genome (DataFrame or dict, default None) – Should contain chromosome sizes. By default, end position of the rightmost intervals are used as proxies for the chromosome size, but it is recommended to use a genome.

  • labels (list of str, default None) – Names to give the PyRanges in the output.

  • strand (bool, default False) – Whether to compute correlations per strand.

  • verbose (bool, default False) – Warn if some chromosomes are in the genome, but not in the PyRanges.

Examples

>>> grs = [pr.data.aorta(), pr.data.aorta(), pr.data.aorta2()]
>>> mcc = pr.stats.mcc(grs, labels="abc", genome={"chr1": 2100000})
>>> mcc
   T  F   TP   FP       TN   FN      MCC
0  a  a  728    0  2099272    0  1.00000
1  a  b  728    0  2099272    0  1.00000
3  a  c  457  485  2098787  271  0.55168
2  b  a  728    0  2099272    0  1.00000
5  b  b  728    0  2099272    0  1.00000
6  b  c  457  485  2098787  271  0.55168
4  c  a  457  271  2098787  485  0.55168
7  c  b  457  271  2098787  485  0.55168
8  c  c  942    0  2099058    0  1.00000

To create a symmetric matrix (useful for heatmaps of correlations):

>>> mcc.set_index(["T", "F"]).MCC.unstack().rename_axis(None, axis=0)
F        a        b        c
a  1.00000  1.00000  0.55168
b  1.00000  1.00000  0.55168
c  0.55168  0.55168  1.00000
pyranges.statistics.rowbased_spearman(x, y)

Fast row-based Spearman’s correlation.

Parameters:
  • x (matrix-like) – 2D numerical matrix. Same size as y.

  • y (matrix-like) – 2D numerical matrix. Same size as x.

Returns:

Array with same length as input, where values are P-values.

Return type:

numpy.array

See also

pyranges.statistics.rowbased_pearson

fast row-based Pearson’s correlation.

pr.stats.fdr

correct for multiple testing

Examples

>>> x = np.array([[7, 2, 9], [3, 6, 0], [0, 6, 3]])
>>> y = np.array([[5, 3, 2], [9, 6, 0], [7, 3, 5]])

Perform Spearman’s correlation pairwise on each row in 10x10 matrixes:

>>> pr.stats.rowbased_spearman(x, y)
array([-0.5,  0.5, -1. ])
pyranges.statistics.rowbased_pearson(x, y)

Fast row-based Pearson’s correlation.

Parameters:
  • x (matrix-like) – 2D numerical matrix. Same size as y.

  • y (matrix-like) – 2D numerical matrix. Same size as x.

Returns:

Array with same length as input, where values are P-values.

Return type:

numpy.array

See also

pyranges.statistics.rowbased_spearman

fast row-based Spearman’s correlation.

pr.stats.fdr

correct for multiple testing

Examples

>>> x = np.array([[7, 2, 9], [3, 6, 0], [0, 6, 3]])
>>> y = np.array([[5, 3, 2], [9, 6, 0], [7, 3, 5]])

Perform Pearson’s correlation pairwise on each row in 10x10 matrixes:

>>> pr.stats.rowbased_pearson(x, y)
array([-0.09078413,  0.65465367, -1.        ])
pyranges.statistics.rowbased_rankdata(data)

Rank order of entries in each row.

Same as SciPy rankdata with method=mean.

Parameters:

data (matrix-like) – The data to find order of.

Returns:

DataFrame where values are order of data.

Return type:

Pandas.DataFrame

Examples

>>> x = np.random.randint(10, size=(3, 4))
>>> x = np.array([[3, 7, 6, 0], [1, 3, 8, 9], [5, 9, 3, 5]])
>>> pr.stats.rowbased_rankdata(x)
     0    1    2    3
0  2.0  4.0  3.0  1.0
1  1.0  2.0  3.0  4.0
2  2.5  4.0  1.0  2.5
pyranges.statistics.simes(df, groupby, pcol, keep_position=False)

Apply Simes method for giving dependent events a p-value.

Parameters:
  • df (pandas.DataFrame) – Data to analyse with Simes.

  • groupby (str or list of str) – Features equal in these columns will be merged with Simes.

  • pcol (str) – Name of column with p-values.

  • keep_position (bool, default False) – Keep columns “Chromosome”, “Start”, “End” and “Strand” if they exist.

See also

pr.stats.fdr

correct for multiple testing

Examples

>>> s = '''Chromosome Start End Strand Gene PValue
... 1 10 20 + P53 0.0001
... 1 20 20 + P53 0.0002
... 1 30 20 + P53 0.0003
... 2 60 65 - FOX 0.05
... 2 70 75 - FOX 0.0000001
... 2 80 90 - FOX 0.0000021'''
>>> gr = pr.from_string(s)
>>> gr
+--------------+-----------+-----------+--------------+------------+-------------+
|   Chromosome |     Start |       End | Strand       | Gene       |      PValue |
|   (category) |   (int64) |   (int64) | (category)   | (object)   |   (float64) |
|--------------+-----------+-----------+--------------+------------+-------------|
|            1 |        10 |        20 | +            | P53        |     0.0001  |
|            1 |        20 |        20 | +            | P53        |     0.0002  |
|            1 |        30 |        20 | +            | P53        |     0.0003  |
|            2 |        60 |        65 | -            | FOX        |     0.05    |
|            2 |        70 |        75 | -            | FOX        |     1e-07   |
|            2 |        80 |        90 | -            | FOX        |     2.1e-06 |
+--------------+-----------+-----------+--------------+------------+-------------+
Stranded PyRanges object has 6 rows and 6 columns from 2 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
>>> simes = pr.stats.simes(gr.df, "Gene", "PValue")
>>> simes
  Gene         Simes
0  FOX  3.000000e-07
1  P53  3.000000e-04
>>> gr.apply(lambda df:
... pr.stats.simes(df, "Gene", "PValue", keep_position=True))
+--------------+-----------+-----------+-------------+--------------+------------+
|   Chromosome |     Start |       End |       Simes | Strand       | Gene       |
|   (category) |   (int64) |   (int64) |   (float64) | (category)   | (object)   |
|--------------+-----------+-----------+-------------+--------------+------------|
|            1 |        10 |        20 |      0.0001 | +            | P53        |
|            2 |        60 |        90 |      1e-07  | -            | FOX        |
+--------------+-----------+-----------+-------------+--------------+------------+
Stranded PyRanges object has 2 rows and 6 columns from 2 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
class pyranges.statistics.StatisticsMethods(pr)

Namespace for statistical comparsion-operations.

Accessed with gr.stats.

pr
forbes(other, chromsizes, strandedness=None)

Compute Forbes coefficient.

Ratio which represents observed versus expected co-occurence.

Described in Forbes SA (1907): On the local distribution of certain Illinois fishes: an essay in statistical ecology.

Parameters:
  • other (PyRanges) – Intervals to compare with.

  • chromsizes (int, dict, DataFrame or PyRanges) – Integer representing genome length or mapping from chromosomes to its length.

  • strandedness ({None, "same", "opposite", False}, default None, i.e. "auto") – Whether to compute without regards to strand or on same or opposite.

Returns:

Ratio of observed versus expected co-occurence.

Return type:

float

See also

pyranges.statistics.jaccard

compute the jaccard coefficient

Examples

>>> gr, gr2 = pr.data.chipseq(), pr.data.chipseq_background()
>>> chromsizes = pr.data.chromsizes()
>>> gr.stats.forbes(gr2, chromsizes=chromsizes)
1.7168314674978278
jaccard(other, **kwargs)

Compute Jaccards coefficient.

Ratio of the intersection and union of two sets.

Parameters:
  • other (PyRanges) – Intervals to compare with.

  • chromsizes (int, dict, DataFrame or PyRanges) – Integer representing genome length or mapping from chromosomes to its length.

  • strandedness ({None, "same", "opposite", False}, default None, i.e. "auto") – Whether to compute without regards to strand or on same or opposite.

Returns:

Ratio of the intersection and union of two sets.

Return type:

float

See also

pyranges.statistics.forbes

compute the forbes coefficient

Examples

>>> gr, gr2 = pr.data.chipseq(), pr.data.chipseq_background()
>>> chromsizes = pr.data.chromsizes()
>>> gr.stats.jaccard(gr2, chromsizes=chromsizes)
6.657941988519211e-05
relative_distance(other, **kwargs)

Compute spatial correllation between two sets.

Metric which describes relative distance between each interval in one set and two closest intervals in another.

Parameters:
  • other (PyRanges) – Intervals to compare with.

  • chromsizes (int, dict, DataFrame or PyRanges) – Integer representing genome length or mapping from chromosomes to its length.

  • strandedness ({None, "same", "opposite", False}, default None, i.e. "auto") – Whether to compute without regards to strand or on same or opposite.

Returns:

DataFrame containing the frequency of each relative distance.

Return type:

pandas.DataFrame

See also

pyranges.statistics.jaccard

compute the jaccard coefficient

pyranges.statistics.forbes

compute the forbes coefficient

Examples

>>> gr, gr2 = pr.data.chipseq(), pr.data.chipseq_background()
>>> chromsizes = pr.data.chromsizes()
>>> gr.stats.relative_distance(gr2)
    reldist  count  total  fraction
0      0.00    264   9956  0.026517
1      0.01    226   9956  0.022700
2      0.02    206   9956  0.020691
3      0.03    235   9956  0.023604
4      0.04    194   9956  0.019486
5      0.05    241   9956  0.024207
6      0.06    201   9956  0.020189
7      0.07    191   9956  0.019184
8      0.08    192   9956  0.019285
9      0.09    191   9956  0.019184
10     0.10    186   9956  0.018682
11     0.11    203   9956  0.020390
12     0.12    218   9956  0.021896
13     0.13    209   9956  0.020992
14     0.14    201   9956  0.020189
15     0.15    178   9956  0.017879
16     0.16    202   9956  0.020289
17     0.17    197   9956  0.019787
18     0.18    208   9956  0.020892
19     0.19    202   9956  0.020289
20     0.20    191   9956  0.019184
21     0.21    188   9956  0.018883
22     0.22    213   9956  0.021394
23     0.23    192   9956  0.019285
24     0.24    199   9956  0.019988
25     0.25    181   9956  0.018180
26     0.26    172   9956  0.017276
27     0.27    191   9956  0.019184
28     0.28    190   9956  0.019084
29     0.29    192   9956  0.019285
30     0.30    201   9956  0.020189
31     0.31    212   9956  0.021294
32     0.32    213   9956  0.021394
33     0.33    177   9956  0.017778
34     0.34    197   9956  0.019787
35     0.35    163   9956  0.016372
36     0.36    191   9956  0.019184
37     0.37    198   9956  0.019888
38     0.38    160   9956  0.016071
39     0.39    188   9956  0.018883
40     0.40    200   9956  0.020088
41     0.41    188   9956  0.018883
42     0.42    230   9956  0.023102
43     0.43    197   9956  0.019787
44     0.44    224   9956  0.022499
45     0.45    184   9956  0.018481
46     0.46    198   9956  0.019888
47     0.47    187   9956  0.018783
48     0.48    200   9956  0.020088
49     0.49    194   9956  0.019486