:py:mod:`pyranges.statistics` ============================= .. py:module:: pyranges.statistics .. autoapi-nested-parse:: Statistics useful for genomics. Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: pyranges.statistics.StatisticsMethods Functions ~~~~~~~~~ .. autoapisummary:: pyranges.statistics.fdr pyranges.statistics.fisher_exact pyranges.statistics.mcc pyranges.statistics.rowbased_spearman pyranges.statistics.rowbased_pearson pyranges.statistics.rowbased_rankdata pyranges.statistics.simes .. py:function:: fdr(p_vals) Adjust p-values with Benjamini-Hochberg. :param data: :type data: array-like :returns: DataFrame where values are order of data. :rtype: Pandas.DataFrame .. rubric:: Examples >>> d = {'Chromosome': ['chr3', 'chr6', 'chr13'], 'Start': [146419383, 39800100, 24537618], 'End': [146419483, 39800200, 24537718], 'Strand': ['-', '+', '-'], 'PValue': [0.0039591368855297175, 0.0037600512992788937, 0.0075061166500909205]} >>> gr = pr.from_dict(d) >>> gr +--------------+-----------+-----------+--------------+-------------+ | Chromosome | Start | End | Strand | PValue | | (category) | (int64) | (int64) | (category) | (float64) | |--------------+-----------+-----------+--------------+-------------| | chr3 | 146419383 | 146419483 | - | 0.00395914 | | chr6 | 39800100 | 39800200 | + | 0.00376005 | | chr13 | 24537618 | 24537718 | - | 0.00750612 | +--------------+-----------+-----------+--------------+-------------+ Stranded PyRanges object has 3 rows and 5 columns from 3 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. >>> gr.FDR = pr.stats.fdr(gr.PValue) >>> gr.print(formatting={"PValue": "{:.4f}", "FDR": "{:.4}"}) +--------------+-----------+-----------+--------------+-------------+-------------+ | Chromosome | Start | End | Strand | PValue | FDR | | (category) | (int64) | (int64) | (category) | (float64) | (float64) | |--------------+-----------+-----------+--------------+-------------+-------------| | chr3 | 146419383 | 146419483 | - | 0.004 | 0.005939 | | chr6 | 39800100 | 39800200 | + | 0.0038 | 0.01128 | | chr13 | 24537618 | 24537718 | - | 0.0075 | 0.007506 | +--------------+-----------+-----------+--------------+-------------+-------------+ Stranded PyRanges object has 3 rows and 6 columns from 3 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. .. py:function:: fisher_exact(tp, fp, fn, tn, pseudocount=0) Fisher's exact for contingency tables. Computes the hypotheses two-sided, less and greater at the same time. The odds-ratio is :param tp: Top left square of contingency table (true positives). :type tp: array-like of int :param fp: Top right square of contingency table (false positives). :type fp: array-like of int :param fn: Bottom left square of contingency table (false negatives). :type fn: array-like of int :param tn: Bottom right square of contingency table (true negatives). :type tn: array-like of int :param pseudocount: Values > 0 allow Odds Ratio to always be a finite number. :type pseudocount: float, default 0 .. rubric:: Notes The odds-ratio is computed thusly: ``((tp + pseudocount) / (fp + pseudocount)) / ((fn + pseudocount) / (tn + pseudocount))`` :returns: DataFrame with columns OR and P, PLeft and PRight. :rtype: pandas.DataFrame .. seealso:: :obj:`pr.stats.fdr` correct for multiple testing .. rubric:: Examples >>> d = {"TP": [12, 0], "FP": [5, 12], "TN": [29, 10], "FN": [2, 2]} >>> df = pd.DataFrame(d) >>> df TP FP TN FN 0 12 5 29 2 1 0 12 10 2 >>> pr.stats.fisher_exact(df.TP, df.FP, df.TN, df.FN) OR P PLeft PRight 0 0.165517 0.080269 0.044555 0.994525 1 0.000000 0.000067 0.000034 1.000000 .. py:function:: mcc(grs, genome=None, labels=None, strand=False, verbose=False) Compute Matthew's correlation coefficient for PyRanges overlaps. :param grs: PyRanges to compare. :type grs: list of PyRanges :param genome: Should contain chromosome sizes. By default, end position of the rightmost intervals are used as proxies for the chromosome size, but it is recommended to use a genome. :type genome: DataFrame or dict, default None :param labels: Names to give the PyRanges in the output. :type labels: list of str, default None :param strand: Whether to compute correlations per strand. :type strand: bool, default False :param verbose: Warn if some chromosomes are in the genome, but not in the PyRanges. :type verbose: bool, default False .. rubric:: Examples >>> grs = [pr.data.aorta(), pr.data.aorta(), pr.data.aorta2()] >>> mcc = pr.stats.mcc(grs, labels="abc", genome={"chr1": 2100000}) >>> mcc T F TP FP TN FN MCC 0 a a 728 0 2099272 0 1.00000 1 a b 728 0 2099272 0 1.00000 3 a c 457 485 2098787 271 0.55168 2 b a 728 0 2099272 0 1.00000 5 b b 728 0 2099272 0 1.00000 6 b c 457 485 2098787 271 0.55168 4 c a 457 271 2098787 485 0.55168 7 c b 457 271 2098787 485 0.55168 8 c c 942 0 2099058 0 1.00000 To create a symmetric matrix (useful for heatmaps of correlations): >>> mcc.set_index(["T", "F"]).MCC.unstack().rename_axis(None, axis=0) F a b c a 1.00000 1.00000 0.55168 b 1.00000 1.00000 0.55168 c 0.55168 0.55168 1.00000 .. py:function:: rowbased_spearman(x, y) Fast row-based Spearman's correlation. :param x: 2D numerical matrix. Same size as y. :type x: matrix-like :param y: 2D numerical matrix. Same size as x. :type y: matrix-like :returns: Array with same length as input, where values are P-values. :rtype: numpy.array .. seealso:: :obj:`pyranges.statistics.rowbased_pearson` fast row-based Pearson's correlation. :obj:`pr.stats.fdr` correct for multiple testing .. rubric:: Examples >>> x = np.array([[7, 2, 9], [3, 6, 0], [0, 6, 3]]) >>> y = np.array([[5, 3, 2], [9, 6, 0], [7, 3, 5]]) Perform Spearman's correlation pairwise on each row in 10x10 matrixes: >>> pr.stats.rowbased_spearman(x, y) array([-0.5, 0.5, -1. ]) .. py:function:: rowbased_pearson(x, y) Fast row-based Pearson's correlation. :param x: 2D numerical matrix. Same size as y. :type x: matrix-like :param y: 2D numerical matrix. Same size as x. :type y: matrix-like :returns: Array with same length as input, where values are P-values. :rtype: numpy.array .. seealso:: :obj:`pyranges.statistics.rowbased_spearman` fast row-based Spearman's correlation. :obj:`pr.stats.fdr` correct for multiple testing .. rubric:: Examples >>> x = np.array([[7, 2, 9], [3, 6, 0], [0, 6, 3]]) >>> y = np.array([[5, 3, 2], [9, 6, 0], [7, 3, 5]]) Perform Pearson's correlation pairwise on each row in 10x10 matrixes: >>> pr.stats.rowbased_pearson(x, y) array([-0.09078413, 0.65465367, -1. ]) .. py:function:: rowbased_rankdata(data) Rank order of entries in each row. Same as SciPy rankdata with method=mean. :param data: The data to find order of. :type data: matrix-like :returns: DataFrame where values are order of data. :rtype: Pandas.DataFrame .. rubric:: Examples >>> x = np.random.randint(10, size=(3, 4)) >>> x = np.array([[3, 7, 6, 0], [1, 3, 8, 9], [5, 9, 3, 5]]) >>> pr.stats.rowbased_rankdata(x) 0 1 2 3 0 2.0 4.0 3.0 1.0 1 1.0 2.0 3.0 4.0 2 2.5 4.0 1.0 2.5 .. py:function:: simes(df, groupby, pcol, keep_position=False) Apply Simes method for giving dependent events a p-value. :param df: Data to analyse with Simes. :type df: pandas.DataFrame :param groupby: Features equal in these columns will be merged with Simes. :type groupby: str or list of str :param pcol: Name of column with p-values. :type pcol: str :param keep_position: Keep columns "Chromosome", "Start", "End" and "Strand" if they exist. :type keep_position: bool, default False .. seealso:: :obj:`pr.stats.fdr` correct for multiple testing .. rubric:: Examples >>> s = '''Chromosome Start End Strand Gene PValue ... 1 10 20 + P53 0.0001 ... 1 20 20 + P53 0.0002 ... 1 30 20 + P53 0.0003 ... 2 60 65 - FOX 0.05 ... 2 70 75 - FOX 0.0000001 ... 2 80 90 - FOX 0.0000021''' >>> gr = pr.from_string(s) >>> gr +--------------+-----------+-----------+--------------+------------+-------------+ | Chromosome | Start | End | Strand | Gene | PValue | | (category) | (int64) | (int64) | (category) | (object) | (float64) | |--------------+-----------+-----------+--------------+------------+-------------| | 1 | 10 | 20 | + | P53 | 0.0001 | | 1 | 20 | 20 | + | P53 | 0.0002 | | 1 | 30 | 20 | + | P53 | 0.0003 | | 2 | 60 | 65 | - | FOX | 0.05 | | 2 | 70 | 75 | - | FOX | 1e-07 | | 2 | 80 | 90 | - | FOX | 2.1e-06 | +--------------+-----------+-----------+--------------+------------+-------------+ Stranded PyRanges object has 6 rows and 6 columns from 2 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. >>> simes = pr.stats.simes(gr.df, "Gene", "PValue") >>> simes Gene Simes 0 FOX 3.000000e-07 1 P53 3.000000e-04 >>> gr.apply(lambda df: ... pr.stats.simes(df, "Gene", "PValue", keep_position=True)) +--------------+-----------+-----------+-------------+--------------+------------+ | Chromosome | Start | End | Simes | Strand | Gene | | (category) | (int64) | (int64) | (float64) | (category) | (object) | |--------------+-----------+-----------+-------------+--------------+------------| | 1 | 10 | 20 | 0.0001 | + | P53 | | 2 | 60 | 90 | 1e-07 | - | FOX | +--------------+-----------+-----------+-------------+--------------+------------+ Stranded PyRanges object has 2 rows and 6 columns from 2 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. .. py:class:: StatisticsMethods(pr) Namespace for statistical comparsion-operations. Accessed with gr.stats. .. py:attribute:: pr .. py:method:: forbes(other, chromsizes, strandedness=None) Compute Forbes coefficient. Ratio which represents observed versus expected co-occurence. Described in ``Forbes SA (1907): On the local distribution of certain Illinois fishes: an essay in statistical ecology.`` :param other: Intervals to compare with. :type other: PyRanges :param chromsizes: Integer representing genome length or mapping from chromosomes to its length. :type chromsizes: int, dict, DataFrame or PyRanges :param strandedness: Whether to compute without regards to strand or on same or opposite. :type strandedness: {None, "same", "opposite", False}, default None, i.e. "auto" :returns: Ratio of observed versus expected co-occurence. :rtype: float .. seealso:: :obj:`pyranges.statistics.jaccard` compute the jaccard coefficient .. rubric:: Examples >>> gr, gr2 = pr.data.chipseq(), pr.data.chipseq_background() >>> chromsizes = pr.data.chromsizes() >>> gr.stats.forbes(gr2, chromsizes=chromsizes) 1.7168314674978278 .. py:method:: jaccard(other, **kwargs) Compute Jaccards coefficient. Ratio of the intersection and union of two sets. :param other: Intervals to compare with. :type other: PyRanges :param chromsizes: Integer representing genome length or mapping from chromosomes to its length. :type chromsizes: int, dict, DataFrame or PyRanges :param strandedness: Whether to compute without regards to strand or on same or opposite. :type strandedness: {None, "same", "opposite", False}, default None, i.e. "auto" :returns: Ratio of the intersection and union of two sets. :rtype: float .. seealso:: :obj:`pyranges.statistics.forbes` compute the forbes coefficient .. rubric:: Examples >>> gr, gr2 = pr.data.chipseq(), pr.data.chipseq_background() >>> chromsizes = pr.data.chromsizes() >>> gr.stats.jaccard(gr2, chromsizes=chromsizes) 6.657941988519211e-05 .. py:method:: relative_distance(other, **kwargs) Compute spatial correllation between two sets. Metric which describes relative distance between each interval in one set and two closest intervals in another. :param other: Intervals to compare with. :type other: PyRanges :param chromsizes: Integer representing genome length or mapping from chromosomes to its length. :type chromsizes: int, dict, DataFrame or PyRanges :param strandedness: Whether to compute without regards to strand or on same or opposite. :type strandedness: {None, "same", "opposite", False}, default None, i.e. "auto" :returns: DataFrame containing the frequency of each relative distance. :rtype: pandas.DataFrame .. seealso:: :obj:`pyranges.statistics.jaccard` compute the jaccard coefficient :obj:`pyranges.statistics.forbes` compute the forbes coefficient .. rubric:: Examples >>> gr, gr2 = pr.data.chipseq(), pr.data.chipseq_background() >>> chromsizes = pr.data.chromsizes() >>> gr.stats.relative_distance(gr2) reldist count total fraction 0 0.00 264 9956 0.026517 1 0.01 226 9956 0.022700 2 0.02 206 9956 0.020691 3 0.03 235 9956 0.023604 4 0.04 194 9956 0.019486 5 0.05 241 9956 0.024207 6 0.06 201 9956 0.020189 7 0.07 191 9956 0.019184 8 0.08 192 9956 0.019285 9 0.09 191 9956 0.019184 10 0.10 186 9956 0.018682 11 0.11 203 9956 0.020390 12 0.12 218 9956 0.021896 13 0.13 209 9956 0.020992 14 0.14 201 9956 0.020189 15 0.15 178 9956 0.017879 16 0.16 202 9956 0.020289 17 0.17 197 9956 0.019787 18 0.18 208 9956 0.020892 19 0.19 202 9956 0.020289 20 0.20 191 9956 0.019184 21 0.21 188 9956 0.018883 22 0.22 213 9956 0.021394 23 0.23 192 9956 0.019285 24 0.24 199 9956 0.019988 25 0.25 181 9956 0.018180 26 0.26 172 9956 0.017276 27 0.27 191 9956 0.019184 28 0.28 190 9956 0.019084 29 0.29 192 9956 0.019285 30 0.30 201 9956 0.020189 31 0.31 212 9956 0.021294 32 0.32 213 9956 0.021394 33 0.33 177 9956 0.017778 34 0.34 197 9956 0.019787 35 0.35 163 9956 0.016372 36 0.36 191 9956 0.019184 37 0.37 198 9956 0.019888 38 0.38 160 9956 0.016071 39 0.39 188 9956 0.018883 40 0.40 200 9956 0.020088 41 0.41 188 9956 0.018883 42 0.42 230 9956 0.023102 43 0.43 197 9956 0.019787 44 0.44 224 9956 0.022499 45 0.45 184 9956 0.018481 46 0.46 198 9956 0.019888 47 0.47 187 9956 0.018783 48 0.48 200 9956 0.020088 49 0.49 194 9956 0.019486