`pyranges.statistics`

Statistics useful for genomics.

Module Contents

Classes

StatisticsMethods

Namespace for statistical comparsion-operations.

Functions

`fdr`(p_vals)	Adjust p-values with Benjamini-Hochberg.
`fisher_exact`(tp, fp, fn, tn[, pseudocount])	Fisher's exact for contingency tables.
`mcc`(grs[, genome, labels, strand, verbose])	Compute Matthew's correlation coefficient for PyRanges overlaps.
`rowbased_spearman`(x, y)	Fast row-based Spearman's correlation.
`rowbased_pearson`(x, y)	Fast row-based Pearson's correlation.
`rowbased_rankdata`(data)	Rank order of entries in each row.
`simes`(df, groupby, pcol[, keep_position])	Apply Simes method for giving dependent events a p-value.

pyranges.statistics.fdr(p_vals)

Adjust p-values with Benjamini-Hochberg.

Parameters:: data (array-like) –
Returns:: DataFrame where values are order of data.
Return type:: Pandas.DataFrame

Examples

>>> d = {'Chromosome': ['chr3', 'chr6', 'chr13'], 'Start': [146419383, 39800100, 24537618], 'End': [146419483, 39800200, 24537718], 'Strand': ['-', '+', '-'], 'PValue': [0.0039591368855297175, 0.0037600512992788937, 0.0075061166500909205]}
>>> gr = pr.from_dict(d)
>>> gr
+--------------+-----------+-----------+--------------+-------------+
| Chromosome   |     Start |       End | Strand       |      PValue |
| (category)   |   (int64) |   (int64) | (category)   |   (float64) |
|--------------+-----------+-----------+--------------+-------------|
| chr3         | 146419383 | 146419483 | -            |  0.00395914 |
| chr6         |  39800100 |  39800200 | +            |  0.00376005 |
| chr13        |  24537618 |  24537718 | -            |  0.00750612 |
+--------------+-----------+-----------+--------------+-------------+
Stranded PyRanges object has 3 rows and 5 columns from 3 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

>>> gr.FDR = pr.stats.fdr(gr.PValue)
>>> gr.print(formatting={"PValue": "{:.4f}", "FDR": "{:.4}"})
+--------------+-----------+-----------+--------------+-------------+-------------+
| Chromosome   |     Start |       End | Strand       |      PValue |         FDR |
| (category)   |   (int64) |   (int64) | (category)   |   (float64) |   (float64) |
|--------------+-----------+-----------+--------------+-------------+-------------|
| chr3         | 146419383 | 146419483 | -            |      0.004  |    0.005939 |
| chr6         |  39800100 |  39800200 | +            |      0.0038 |    0.01128  |
| chr13        |  24537618 |  24537718 | -            |      0.0075 |    0.007506 |
+--------------+-----------+-----------+--------------+-------------+-------------+
Stranded PyRanges object has 3 rows and 6 columns from 3 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

pyranges.statistics.fisher_exact(tp, fp, fn, tn, pseudocount=0)

Fisher’s exact for contingency tables.

Computes the hypotheses two-sided, less and greater at the same time.

The odds-ratio is

Parameters:

tp (array-like of int) – Top left square of contingency table (true positives).
fp (array-like of int) – Top right square of contingency table (false positives).
fn (array-like of int) – Bottom left square of contingency table (false negatives).
tn (array-like of int) – Bottom right square of contingency table (true negatives).
pseudocount (float, default 0) – Values > 0 allow Odds Ratio to always be a finite number.

Notes

The odds-ratio is computed thusly:

((tp + pseudocount) / (fp + pseudocount)) / ((fn + pseudocount) / (tn + pseudocount))

Returns:: DataFrame with columns OR and P, PLeft and PRight.
Return type:: pandas.DataFrame

See also

pr.stats.fdr: correct for multiple testing

Examples

>>> d = {"TP": [12, 0], "FP": [5, 12], "TN": [29, 10], "FN": [2, 2]}
>>> df = pd.DataFrame(d)
>>> df
   TP  FP  TN  FN
0  12   5  29   2
1   0  12  10   2

>>> pr.stats.fisher_exact(df.TP, df.FP, df.TN, df.FN)
         OR         P     PLeft    PRight
0  0.165517  0.080269  0.044555  0.994525
1  0.000000  0.000067  0.000034  1.000000

pyranges.statistics.mcc(grs, genome=None, labels=None, strand=False, verbose=False)

Compute Matthew’s correlation coefficient for PyRanges overlaps.

Parameters:

grs (list of PyRanges) – PyRanges to compare.
genome (DataFrame or dict, default None) – Should contain chromosome sizes. By default, end position of the rightmost intervals are used as proxies for the chromosome size, but it is recommended to use a genome.
labels (list of str, default None) – Names to give the PyRanges in the output.
strand (bool, default False) – Whether to compute correlations per strand.
verbose (bool, default False) – Warn if some chromosomes are in the genome, but not in the PyRanges.

Examples

>>> grs = [pr.data.aorta(), pr.data.aorta(), pr.data.aorta2()]
>>> mcc = pr.stats.mcc(grs, labels="abc", genome={"chr1": 2100000})
>>> mcc
   T  F   TP   FP       TN   FN      MCC
0  a  a  728    0  2099272    0  1.00000
1  a  b  728    0  2099272    0  1.00000
3  a  c  457  485  2098787  271  0.55168
2  b  a  728    0  2099272    0  1.00000
5  b  b  728    0  2099272    0  1.00000
6  b  c  457  485  2098787  271  0.55168
4  c  a  457  271  2098787  485  0.55168
7  c  b  457  271  2098787  485  0.55168
8  c  c  942    0  2099058    0  1.00000

To create a symmetric matrix (useful for heatmaps of correlations):

>>> mcc.set_index(["T", "F"]).MCC.unstack().rename_axis(None, axis=0)
F        a        b        c
a  1.00000  1.00000  0.55168
b  1.00000  1.00000  0.55168
c  0.55168  0.55168  1.00000

pyranges.statistics.rowbased_spearman(x, y)

Fast row-based Spearman’s correlation.

Parameters:

x (matrix-like) – 2D numerical matrix. Same size as y.
y (matrix-like) – 2D numerical matrix. Same size as x.

Returns:

Array with same length as input, where values are P-values.

Return type:

numpy.array

See also

pyranges.statistics.rowbased_pearson: fast row-based Pearson’s correlation.
pr.stats.fdr: correct for multiple testing

Examples

>>> x = np.array([[7, 2, 9], [3, 6, 0], [0, 6, 3]])
>>> y = np.array([[5, 3, 2], [9, 6, 0], [7, 3, 5]])

Perform Spearman’s correlation pairwise on each row in 10x10 matrixes:

>>> pr.stats.rowbased_spearman(x, y)
array([-0.5,  0.5, -1. ])

pyranges.statistics.rowbased_pearson(x, y)

Fast row-based Pearson’s correlation.

Parameters:

x (matrix-like) – 2D numerical matrix. Same size as y.
y (matrix-like) – 2D numerical matrix. Same size as x.

Returns:

Array with same length as input, where values are P-values.

Return type:

numpy.array

See also

pyranges.statistics.rowbased_spearman: fast row-based Spearman’s correlation.
pr.stats.fdr: correct for multiple testing

Examples

>>> x = np.array([[7, 2, 9], [3, 6, 0], [0, 6, 3]])
>>> y = np.array([[5, 3, 2], [9, 6, 0], [7, 3, 5]])

Perform Pearson’s correlation pairwise on each row in 10x10 matrixes:

>>> pr.stats.rowbased_pearson(x, y)
array([-0.09078413,  0.65465367, -1.        ])

pyranges.statistics.rowbased_rankdata(data)

Rank order of entries in each row.

Same as SciPy rankdata with method=mean.

Parameters:: data (matrix-like) – The data to find order of.
Returns:: DataFrame where values are order of data.
Return type:: Pandas.DataFrame

Examples

>>> x = np.random.randint(10, size=(3, 4))
>>> x = np.array([[3, 7, 6, 0], [1, 3, 8, 9], [5, 9, 3, 5]])
>>> pr.stats.rowbased_rankdata(x)
     0    1    2    3
0  2.0  4.0  3.0  1.0
1  1.0  2.0  3.0  4.0
2  2.5  4.0  1.0  2.5

pyranges.statistics.simes(df, groupby, pcol, keep_position=False)

Apply Simes method for giving dependent events a p-value.

Parameters:

df (pandas.DataFrame) – Data to analyse with Simes.
groupby (str or list of str) – Features equal in these columns will be merged with Simes.
pcol (str) – Name of column with p-values.
keep_position (bool, default False) – Keep columns “Chromosome”, “Start”, “End” and “Strand” if they exist.

See also

pr.stats.fdr: correct for multiple testing

Examples

>>> s = '''Chromosome Start End Strand Gene PValue
... 1 10 20 + P53 0.0001
... 1 20 20 + P53 0.0002
... 1 30 20 + P53 0.0003
... 2 60 65 - FOX 0.05
... 2 70 75 - FOX 0.0000001
... 2 80 90 - FOX 0.0000021'''

>>> gr = pr.from_string(s)
>>> gr
+--------------+-----------+-----------+--------------+------------+-------------+
|   Chromosome |     Start |       End | Strand       | Gene       |      PValue |
|   (category) |   (int64) |   (int64) | (category)   | (object)   |   (float64) |
|--------------+-----------+-----------+--------------+------------+-------------|
|            1 |        10 |        20 | +            | P53        |     0.0001  |
|            1 |        20 |        20 | +            | P53        |     0.0002  |
|            1 |        30 |        20 | +            | P53        |     0.0003  |
|            2 |        60 |        65 | -            | FOX        |     0.05    |
|            2 |        70 |        75 | -            | FOX        |     1e-07   |
|            2 |        80 |        90 | -            | FOX        |     2.1e-06 |
+--------------+-----------+-----------+--------------+------------+-------------+
Stranded PyRanges object has 6 rows and 6 columns from 2 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

>>> simes = pr.stats.simes(gr.df, "Gene", "PValue")
>>> simes
  Gene         Simes
0  FOX  3.000000e-07
1  P53  3.000000e-04

>>> gr.apply(lambda df:
... pr.stats.simes(df, "Gene", "PValue", keep_position=True))
+--------------+-----------+-----------+-------------+--------------+------------+
|   Chromosome |     Start |       End |       Simes | Strand       | Gene       |
|   (category) |   (int64) |   (int64) |   (float64) | (category)   | (object)   |
|--------------+-----------+-----------+-------------+--------------+------------|
|            1 |        10 |        20 |      0.0001 | +            | P53        |
|            2 |        60 |        90 |      1e-07  | -            | FOX        |
+--------------+-----------+-----------+-------------+--------------+------------+
Stranded PyRanges object has 2 rows and 6 columns from 2 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

class pyranges.statistics.StatisticsMethods(pr)

Namespace for statistical comparsion-operations.

Accessed with gr.stats.

pr

forbes(other, chromsizes, strandedness=None)

Compute Forbes coefficient.

Ratio which represents observed versus expected co-occurence.

Described in Forbes SA (1907): On the local distribution of certain Illinois fishes: an essay in statistical ecology.

Parameters:

other (PyRanges) – Intervals to compare with.
chromsizes (int, dict, DataFrame or PyRanges) – Integer representing genome length or mapping from chromosomes to its length.
strandedness ({None, "same", "opposite", False}, default None, i.e. "auto") – Whether to compute without regards to strand or on same or opposite.

Returns:

Ratio of observed versus expected co-occurence.

Return type:

float

See also

pyranges.statistics.jaccard: compute the jaccard coefficient

Examples

>>> gr, gr2 = pr.data.chipseq(), pr.data.chipseq_background()
>>> chromsizes = pr.data.chromsizes()
>>> gr.stats.forbes(gr2, chromsizes=chromsizes)
1.7168314674978278

jaccard(other, **kwargs)

Compute Jaccards coefficient.

Ratio of the intersection and union of two sets.

Parameters:

other (PyRanges) – Intervals to compare with.
chromsizes (int, dict, DataFrame or PyRanges) – Integer representing genome length or mapping from chromosomes to its length.
strandedness ({None, "same", "opposite", False}, default None, i.e. "auto") – Whether to compute without regards to strand or on same or opposite.

Returns:

Ratio of the intersection and union of two sets.

Return type:

float

See also

pyranges.statistics.forbes: compute the forbes coefficient

Examples

>>> gr, gr2 = pr.data.chipseq(), pr.data.chipseq_background()
>>> chromsizes = pr.data.chromsizes()
>>> gr.stats.jaccard(gr2, chromsizes=chromsizes)
6.657941988519211e-05

relative_distance(other, **kwargs)

Compute spatial correllation between two sets.

Metric which describes relative distance between each interval in one set and two closest intervals in another.

Parameters:

other (PyRanges) – Intervals to compare with.
chromsizes (int, dict, DataFrame or PyRanges) – Integer representing genome length or mapping from chromosomes to its length.
strandedness ({None, "same", "opposite", False}, default None, i.e. "auto") – Whether to compute without regards to strand or on same or opposite.

Returns:

DataFrame containing the frequency of each relative distance.

Return type:

pandas.DataFrame

See also

pyranges.statistics.jaccard: compute the jaccard coefficient
pyranges.statistics.forbes: compute the forbes coefficient

Examples

>>> gr, gr2 = pr.data.chipseq(), pr.data.chipseq_background()
>>> chromsizes = pr.data.chromsizes()
>>> gr.stats.relative_distance(gr2)
    reldist  count  total  fraction
    0.00    264   9956  0.026517
    0.01    226   9956  0.022700
    0.02    206   9956  0.020691
    0.03    235   9956  0.023604
    0.04    194   9956  0.019486
    0.05    241   9956  0.024207
    0.06    201   9956  0.020189
    0.07    191   9956  0.019184
    0.08    192   9956  0.019285
    0.09    191   9956  0.019184
   0.10    186   9956  0.018682
   0.11    203   9956  0.020390
   0.12    218   9956  0.021896
   0.13    209   9956  0.020992
   0.14    201   9956  0.020189
   0.15    178   9956  0.017879
   0.16    202   9956  0.020289
   0.17    197   9956  0.019787
   0.18    208   9956  0.020892
   0.19    202   9956  0.020289
   0.20    191   9956  0.019184
   0.21    188   9956  0.018883
   0.22    213   9956  0.021394
   0.23    192   9956  0.019285
   0.24    199   9956  0.019988
   0.25    181   9956  0.018180
   0.26    172   9956  0.017276
   0.27    191   9956  0.019184
   0.28    190   9956  0.019084
   0.29    192   9956  0.019285
   0.30    201   9956  0.020189
   0.31    212   9956  0.021294
   0.32    213   9956  0.021394
   0.33    177   9956  0.017778
   0.34    197   9956  0.019787
   0.35    163   9956  0.016372
   0.36    191   9956  0.019184
   0.37    198   9956  0.019888
   0.38    160   9956  0.016071
   0.39    188   9956  0.018883
   0.40    200   9956  0.020088
   0.41    188   9956  0.018883
   0.42    230   9956  0.023102
   0.43    197   9956  0.019787
   0.44    224   9956  0.022499
   0.45    184   9956  0.018481
   0.46    198   9956  0.019888
   0.47    187   9956  0.018783
   0.48    200   9956  0.020088
   0.49    194   9956  0.019486

pyranges.statistics

Module Contents

Classes

Functions

`pyranges.statistics`