:py:mod:`pyranges.statistics`
=============================

.. py:module:: pyranges.statistics

.. autoapi-nested-parse::

   Statistics useful for genomics.


Module Contents
---------------

Classes
~~~~~~~

.. autoapisummary::

   pyranges.statistics.StatisticsMethods


Functions
~~~~~~~~~

.. autoapisummary::

   pyranges.statistics.fdr
   pyranges.statistics.fisher_exact
   pyranges.statistics.mcc
   pyranges.statistics.rowbased_spearman
   pyranges.statistics.rowbased_pearson
   pyranges.statistics.rowbased_rankdata
   pyranges.statistics.simes


.. py:function:: fdr(p_vals)

   Adjust p-values with Benjamini-Hochberg.

   :param data:
   :type data: array-like

   :returns: DataFrame where values are order of data.
   :rtype: Pandas.DataFrame

   .. rubric:: Examples

   >>> d = {'Chromosome': ['chr3', 'chr6', 'chr13'], 'Start': [146419383, 39800100, 24537618], 'End': [146419483, 39800200, 24537718], 'Strand': ['-', '+', '-'], 'PValue': [0.0039591368855297175, 0.0037600512992788937, 0.0075061166500909205]}
   >>> gr = pr.from_dict(d)
   >>> gr
   +--------------+-----------+-----------+--------------+-------------+
   | Chromosome   |     Start |       End | Strand       |      PValue |
   | (category)   |   (int64) |   (int64) | (category)   |   (float64) |
   |--------------+-----------+-----------+--------------+-------------|
   | chr3         | 146419383 | 146419483 | -            |  0.00395914 |
   | chr6         |  39800100 |  39800200 | +            |  0.00376005 |
   | chr13        |  24537618 |  24537718 | -            |  0.00750612 |
   +--------------+-----------+-----------+--------------+-------------+
   Stranded PyRanges object has 3 rows and 5 columns from 3 chromosomes.
   For printing, the PyRanges was sorted on Chromosome and Strand.

   >>> gr.FDR = pr.stats.fdr(gr.PValue)
   >>> gr.print(formatting={"PValue": "{:.4f}", "FDR": "{:.4}"})
   +--------------+-----------+-----------+--------------+-------------+-------------+
   | Chromosome   |     Start |       End | Strand       |      PValue |         FDR |
   | (category)   |   (int64) |   (int64) | (category)   |   (float64) |   (float64) |
   |--------------+-----------+-----------+--------------+-------------+-------------|
   | chr3         | 146419383 | 146419483 | -            |      0.004  |    0.005939 |
   | chr6         |  39800100 |  39800200 | +            |      0.0038 |    0.01128  |
   | chr13        |  24537618 |  24537718 | -            |      0.0075 |    0.007506 |
   +--------------+-----------+-----------+--------------+-------------+-------------+
   Stranded PyRanges object has 3 rows and 6 columns from 3 chromosomes.
   For printing, the PyRanges was sorted on Chromosome and Strand.


.. py:function:: fisher_exact(tp, fp, fn, tn, pseudocount=0)

   Fisher's exact for contingency tables.

   Computes the hypotheses two-sided, less and greater at the same time.

   The odds-ratio is

   :param tp: Top left square of contingency table (true positives).
   :type tp: array-like of int
   :param fp: Top right square of contingency table (false positives).
   :type fp: array-like of int
   :param fn: Bottom left square of contingency table (false negatives).
   :type fn: array-like of int
   :param tn: Bottom right square of contingency table (true negatives).
   :type tn: array-like of int
   :param pseudocount: Values > 0 allow Odds Ratio to always be a finite number.
   :type pseudocount: float, default 0

   .. rubric:: Notes

   The odds-ratio is computed thusly:

   ``((tp + pseudocount) / (fp + pseudocount)) / ((fn + pseudocount) / (tn + pseudocount))``

   :returns: DataFrame with columns OR and P, PLeft and PRight.
   :rtype: pandas.DataFrame

   .. seealso::

      :obj:`pr.stats.fdr`
          correct for multiple testing

   .. rubric:: Examples

   >>> d = {"TP": [12, 0], "FP": [5, 12], "TN": [29, 10], "FN": [2, 2]}
   >>> df = pd.DataFrame(d)
   >>> df
      TP  FP  TN  FN
   0  12   5  29   2
   1   0  12  10   2

   >>> pr.stats.fisher_exact(df.TP, df.FP, df.TN, df.FN)
            OR         P     PLeft    PRight
   0  0.165517  0.080269  0.044555  0.994525
   1  0.000000  0.000067  0.000034  1.000000


.. py:function:: mcc(grs, genome=None, labels=None, strand=False, verbose=False)

   Compute Matthew's correlation coefficient for PyRanges overlaps.

   :param grs: PyRanges to compare.
   :type grs: list of PyRanges
   :param genome: Should contain chromosome sizes. By default, end position of the
                  rightmost intervals are used as proxies for the chromosome size, but
                  it is recommended to use a genome.
   :type genome: DataFrame or dict, default None
   :param labels: Names to give the PyRanges in the output.
   :type labels: list of str, default None
   :param strand: Whether to compute correlations per strand.
   :type strand: bool, default False
   :param verbose: Warn if some chromosomes are in the genome, but not in the PyRanges.
   :type verbose: bool, default False

   .. rubric:: Examples

   >>> grs = [pr.data.aorta(), pr.data.aorta(), pr.data.aorta2()]
   >>> mcc = pr.stats.mcc(grs, labels="abc", genome={"chr1": 2100000})
   >>> mcc
      T  F   TP   FP       TN   FN      MCC
   0  a  a  728    0  2099272    0  1.00000
   1  a  b  728    0  2099272    0  1.00000
   3  a  c  457  485  2098787  271  0.55168
   2  b  a  728    0  2099272    0  1.00000
   5  b  b  728    0  2099272    0  1.00000
   6  b  c  457  485  2098787  271  0.55168
   4  c  a  457  271  2098787  485  0.55168
   7  c  b  457  271  2098787  485  0.55168
   8  c  c  942    0  2099058    0  1.00000

   To create a symmetric matrix (useful for heatmaps of correlations):

   >>> mcc.set_index(["T", "F"]).MCC.unstack().rename_axis(None, axis=0)
   F        a        b        c
   a  1.00000  1.00000  0.55168
   b  1.00000  1.00000  0.55168
   c  0.55168  0.55168  1.00000


.. py:function:: rowbased_spearman(x, y)

   Fast row-based Spearman's correlation.

   :param x: 2D numerical matrix. Same size as y.
   :type x: matrix-like
   :param y: 2D numerical matrix. Same size as x.
   :type y: matrix-like

   :returns: Array with same length as input, where values are P-values.
   :rtype: numpy.array

   .. seealso::

      :obj:`pyranges.statistics.rowbased_pearson`
          fast row-based Pearson's correlation.

      :obj:`pr.stats.fdr`
          correct for multiple testing

   .. rubric:: Examples

   >>> x = np.array([[7, 2, 9], [3, 6, 0], [0, 6, 3]])
   >>> y = np.array([[5, 3, 2], [9, 6, 0], [7, 3, 5]])

   Perform Spearman's correlation pairwise on each row in 10x10 matrixes:

   >>> pr.stats.rowbased_spearman(x, y)
   array([-0.5,  0.5, -1. ])


.. py:function:: rowbased_pearson(x, y)

   Fast row-based Pearson's correlation.

   :param x: 2D numerical matrix. Same size as y.
   :type x: matrix-like
   :param y: 2D numerical matrix. Same size as x.
   :type y: matrix-like

   :returns: Array with same length as input, where values are P-values.
   :rtype: numpy.array

   .. seealso::

      :obj:`pyranges.statistics.rowbased_spearman`
          fast row-based Spearman's correlation.

      :obj:`pr.stats.fdr`
          correct for multiple testing

   .. rubric:: Examples

   >>> x = np.array([[7, 2, 9], [3, 6, 0], [0, 6, 3]])
   >>> y = np.array([[5, 3, 2], [9, 6, 0], [7, 3, 5]])

   Perform Pearson's correlation pairwise on each row in 10x10 matrixes:

   >>> pr.stats.rowbased_pearson(x, y)
   array([-0.09078413,  0.65465367, -1.        ])


.. py:function:: rowbased_rankdata(data)

   Rank order of entries in each row.

   Same as SciPy rankdata with method=mean.

   :param data: The data to find order of.
   :type data: matrix-like

   :returns: DataFrame where values are order of data.
   :rtype: Pandas.DataFrame

   .. rubric:: Examples

   >>> x = np.random.randint(10, size=(3, 4))
   >>> x = np.array([[3, 7, 6, 0], [1, 3, 8, 9], [5, 9, 3, 5]])
   >>> pr.stats.rowbased_rankdata(x)
        0    1    2    3
   0  2.0  4.0  3.0  1.0
   1  1.0  2.0  3.0  4.0
   2  2.5  4.0  1.0  2.5


.. py:function:: simes(df, groupby, pcol, keep_position=False)

   Apply Simes method for giving dependent events a p-value.

   :param df: Data to analyse with Simes.
   :type df: pandas.DataFrame
   :param groupby: Features equal in these columns will be merged with Simes.
   :type groupby: str or list of str
   :param pcol: Name of column with p-values.
   :type pcol: str
   :param keep_position: Keep columns "Chromosome", "Start", "End" and "Strand" if they exist.
   :type keep_position: bool, default False

   .. seealso::

      :obj:`pr.stats.fdr`
          correct for multiple testing

   .. rubric:: Examples

   >>> s = '''Chromosome Start End Strand Gene PValue
   ... 1 10 20 + P53 0.0001
   ... 1 20 20 + P53 0.0002
   ... 1 30 20 + P53 0.0003
   ... 2 60 65 - FOX 0.05
   ... 2 70 75 - FOX 0.0000001
   ... 2 80 90 - FOX 0.0000021'''

   >>> gr = pr.from_string(s)
   >>> gr
   +--------------+-----------+-----------+--------------+------------+-------------+
   |   Chromosome |     Start |       End | Strand       | Gene       |      PValue |
   |   (category) |   (int64) |   (int64) | (category)   | (object)   |   (float64) |
   |--------------+-----------+-----------+--------------+------------+-------------|
   |            1 |        10 |        20 | +            | P53        |     0.0001  |
   |            1 |        20 |        20 | +            | P53        |     0.0002  |
   |            1 |        30 |        20 | +            | P53        |     0.0003  |
   |            2 |        60 |        65 | -            | FOX        |     0.05    |
   |            2 |        70 |        75 | -            | FOX        |     1e-07   |
   |            2 |        80 |        90 | -            | FOX        |     2.1e-06 |
   +--------------+-----------+-----------+--------------+------------+-------------+
   Stranded PyRanges object has 6 rows and 6 columns from 2 chromosomes.
   For printing, the PyRanges was sorted on Chromosome and Strand.

   >>> simes = pr.stats.simes(gr.df, "Gene", "PValue")
   >>> simes
     Gene         Simes
   0  FOX  3.000000e-07
   1  P53  3.000000e-04

   >>> gr.apply(lambda df:
   ... pr.stats.simes(df, "Gene", "PValue", keep_position=True))
   +--------------+-----------+-----------+-------------+--------------+------------+
   |   Chromosome |     Start |       End |       Simes | Strand       | Gene       |
   |   (category) |   (int64) |   (int64) |   (float64) | (category)   | (object)   |
   |--------------+-----------+-----------+-------------+--------------+------------|
   |            1 |        10 |        20 |      0.0001 | +            | P53        |
   |            2 |        60 |        90 |      1e-07  | -            | FOX        |
   +--------------+-----------+-----------+-------------+--------------+------------+
   Stranded PyRanges object has 2 rows and 6 columns from 2 chromosomes.
   For printing, the PyRanges was sorted on Chromosome and Strand.


.. py:class:: StatisticsMethods(pr)


   Namespace for statistical comparsion-operations.

   Accessed with gr.stats.

   .. py:attribute:: pr

      
   .. py:method:: forbes(other, chromsizes, strandedness=None)

      Compute Forbes coefficient.

      Ratio which represents observed versus expected co-occurence.

      Described in ``Forbes SA (1907): On the local distribution of certain Illinois fishes: an essay in statistical ecology.``

      :param other: Intervals to compare with.
      :type other: PyRanges
      :param chromsizes: Integer representing genome length or mapping from chromosomes
                         to its length.
      :type chromsizes: int, dict, DataFrame or PyRanges
      :param strandedness: Whether to compute without regards to strand or on same or opposite.
      :type strandedness: {None, "same", "opposite", False}, default None, i.e. "auto"

      :returns: Ratio of observed versus expected co-occurence.
      :rtype: float

      .. seealso::

         :obj:`pyranges.statistics.jaccard`
             compute the jaccard coefficient

      .. rubric:: Examples

      >>> gr, gr2 = pr.data.chipseq(), pr.data.chipseq_background()
      >>> chromsizes = pr.data.chromsizes()
      >>> gr.stats.forbes(gr2, chromsizes=chromsizes)
      1.7168314674978278


   .. py:method:: jaccard(other, **kwargs)

      Compute Jaccards coefficient.

      Ratio of the intersection and union of two sets.

      :param other: Intervals to compare with.
      :type other: PyRanges
      :param chromsizes: Integer representing genome length or mapping from chromosomes
                         to its length.
      :type chromsizes: int, dict, DataFrame or PyRanges
      :param strandedness: Whether to compute without regards to strand or on same or opposite.
      :type strandedness: {None, "same", "opposite", False}, default None, i.e. "auto"

      :returns: Ratio of the intersection and union of two sets.
      :rtype: float

      .. seealso::

         :obj:`pyranges.statistics.forbes`
             compute the forbes coefficient

      .. rubric:: Examples

      >>> gr, gr2 = pr.data.chipseq(), pr.data.chipseq_background()
      >>> chromsizes = pr.data.chromsizes()
      >>> gr.stats.jaccard(gr2, chromsizes=chromsizes)
      6.657941988519211e-05


   .. py:method:: relative_distance(other, **kwargs)

      Compute spatial correllation between two sets.

      Metric which describes relative distance between each interval in one
      set and two closest intervals in another.

      :param other: Intervals to compare with.
      :type other: PyRanges
      :param chromsizes: Integer representing genome length or mapping from chromosomes
                         to its length.
      :type chromsizes: int, dict, DataFrame or PyRanges
      :param strandedness: Whether to compute without regards to strand or on same or opposite.
      :type strandedness: {None, "same", "opposite", False}, default None, i.e. "auto"

      :returns: DataFrame containing the frequency of each relative distance.
      :rtype: pandas.DataFrame

      .. seealso::

         :obj:`pyranges.statistics.jaccard`
             compute the jaccard coefficient

         :obj:`pyranges.statistics.forbes`
             compute the forbes coefficient

      .. rubric:: Examples

      >>> gr, gr2 = pr.data.chipseq(), pr.data.chipseq_background()
      >>> chromsizes = pr.data.chromsizes()
      >>> gr.stats.relative_distance(gr2)
          reldist  count  total  fraction
      0      0.00    264   9956  0.026517
      1      0.01    226   9956  0.022700
      2      0.02    206   9956  0.020691
      3      0.03    235   9956  0.023604
      4      0.04    194   9956  0.019486
      5      0.05    241   9956  0.024207
      6      0.06    201   9956  0.020189
      7      0.07    191   9956  0.019184
      8      0.08    192   9956  0.019285
      9      0.09    191   9956  0.019184
      10     0.10    186   9956  0.018682
      11     0.11    203   9956  0.020390
      12     0.12    218   9956  0.021896
      13     0.13    209   9956  0.020992
      14     0.14    201   9956  0.020189
      15     0.15    178   9956  0.017879
      16     0.16    202   9956  0.020289
      17     0.17    197   9956  0.019787
      18     0.18    208   9956  0.020892
      19     0.19    202   9956  0.020289
      20     0.20    191   9956  0.019184
      21     0.21    188   9956  0.018883
      22     0.22    213   9956  0.021394
      23     0.23    192   9956  0.019285
      24     0.24    199   9956  0.019988
      25     0.25    181   9956  0.018180
      26     0.26    172   9956  0.017276
      27     0.27    191   9956  0.019184
      28     0.28    190   9956  0.019084
      29     0.29    192   9956  0.019285
      30     0.30    201   9956  0.020189
      31     0.31    212   9956  0.021294
      32     0.32    213   9956  0.021394
      33     0.33    177   9956  0.017778
      34     0.34    197   9956  0.019787
      35     0.35    163   9956  0.016372
      36     0.36    191   9956  0.019184
      37     0.37    198   9956  0.019888
      38     0.38    160   9956  0.016071
      39     0.39    188   9956  0.018883
      40     0.40    200   9956  0.020088
      41     0.41    188   9956  0.018883
      42     0.42    230   9956  0.023102
      43     0.43    197   9956  0.019787
      44     0.44    224   9956  0.022499
      45     0.45    184   9956  0.018481
      46     0.46    198   9956  0.019888
      47     0.47    187   9956  0.018783
      48     0.48    200   9956  0.020088
      49     0.49    194   9956  0.019486