Inspecting PyRanges

String representation

Print a PyRanges object for an overview of its data:

>>> import pyranges1 as pr
>>> gr = pr.example_data.chipseq
>>> print(gr)
index    |    Chromosome    Start      End        Name    Score    Strand
int64    |    category      int64      int64      str     int64    category
-------  ---  ------------  ---------  ---------  ------  -------  ----------
0        |    chr8          28510032   28510057   U0      0        -
1        |    chr7          107153363  107153388  U0      0        -
2        |    chr5          135821802  135821827  U0      0        -
3        |    chr14         19418999   19419024   U0      0        -
...      |    ...           ...        ...        ...     ...      ...
16       |    chr9          120803448  120803473  U0      0        +
17       |    chr6          89296757   89296782   U0      0        -
18       |    chr1          194245558  194245583  U0      0        +
19       |    chr8          57916061   57916086   U0      0        +
PyRanges with 20 rows, 6 columns, and 1 index columns.
Contains 15 chromosomes and 2 strands.

To obtain this representation, you can invoke the str builtin, e.g. with str(gr).

>>> a = str(gr)
>>> print(a)
index    |    Chromosome    Start      End        Name    Score    Strand
int64    |    category      int64      int64      str     int64    category
-------  ---  ------------  ---------  ---------  ------  -------  ----------
0        |    chr8          28510032   28510057   U0      0        -
1        |    chr7          107153363  107153388  U0      0        -
2        |    chr5          135821802  135821827  U0      0        -
3        |    chr14         19418999   19419024   U0      0        -
...      |    ...           ...        ...        ...     ...      ...
16       |    chr9          120803448  120803473  U0      0        +
17       |    chr6          89296757   89296782   U0      0        -
18       |    chr1          194245558  194245583  U0      0        +
19       |    chr8          57916061   57916086   U0      0        +
PyRanges with 20 rows, 6 columns, and 1 index columns.
Contains 15 chromosomes and 2 strands.

Only a limited number of rows are displayed, which are taken from the top and bottom of the table. You can change the number of rows displayed in any PyRanges using pyranges1.options.set_options() as such:

>>> pr.options.set_option('max_rows_to_show', 20)
>>> gr
  index  |    Chromosome        Start        End  Name      Score  Strand
  int64  |    category          int64      int64  str       int64  category
-------  ---  ------------  ---------  ---------  ------  -------  ----------
      0  |    chr8           28510032   28510057  U0            0  -
      1  |    chr7          107153363  107153388  U0            0  -
      2  |    chr5          135821802  135821827  U0            0  -
      3  |    chr14          19418999   19419024  U0            0  -
      4  |    chr12         106679761  106679786  U0            0  -
      5  |    chr21          40099618   40099643  U0            0  +
      6  |    chr8           22714402   22714427  U0            0  -
      7  |    chr19          19571102   19571127  U0            0  +
      8  |    chr3          140986358  140986383  U0            0  -
      9  |    chr10          35419784   35419809  U0            0  -
     10  |    chr4           98488749   98488774  U0            0  +
     11  |    chr11          22225193   22225218  U0            0  +
     12  |    chr1           38457520   38457545  U0            0  +
     13  |    chr1           80668132   80668157  U0            0  -
     14  |    chr2          152562484  152562509  U0            0  -
     15  |    chr4          153155301  153155326  U0            0  +
     16  |    chr9          120803448  120803473  U0            0  +
     17  |    chr6           89296757   89296782  U0            0  -
     18  |    chr1          194245558  194245583  U0            0  +
     19  |    chr8           57916061   57916086  U0            0  +
PyRanges with 20 rows, 6 columns, and 1 index columns.
Contains 15 chromosomes and 2 strands.

Let’s reset display options to defaults:

>>> pr.options.reset_options()

Detecting invalid PyRanges

The string representation of PyRanges shows useful information to detect data anomalies.

For example, intervals may have invalid lengths. Note that message at the bottom:

>>> pr.PyRanges(dict(Chromosome='chr1', Start=[1, 10], End=[0, 20]))
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                1        0
      1  |    chr1               10       20
PyRanges with 2 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.
Invalid ranges:
  * 1 intervals are empty or negative length (end <= start). See indexes: 0

Intervals may also be invalid because of NaN in their Start or End values:

>>> pr.PyRanges(dict(Chromosome='chr1', Start=[None, 10], End=[0, 20]))
  index  |    Chromosome        Start      End
  int64  |    str             float64    int64
-------  ---  ------------  ---------  -------
      0  |    chr1                nan        0
      1  |    chr1                 10       20
PyRanges with 2 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.
Invalid ranges:
  * 1 starts or ends are nan. See indexes: 0

Or because they have negative Start/End values, see below. This can be remedied with function clip_ranges.

>>> pr.PyRanges(dict(Chromosome='chr1', Start=[1, -10], End=[11, 20]))
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                1       11
      1  |    chr1              -10       20
PyRanges with 2 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.
Invalid ranges:
  * 1 starts or ends are < 0. See indexes: 1

A relatively common case is PyRanges objects that have a Strand column, but the strands are not valid genomic strands. Note the warning in the last line of the string representation:

>>> g = pr.PyRanges(dict(Chromosome='chr1', Start=[1, 1], End=[11, 20], Strand=['-', '#']))
>>> g
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                1       11  -
      1  |    chr1                1       20  #
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands (including non-genomic strands: #).

Non-valid strands can affect the functioning of many methods that have a use_strand parameter (e.g. slice_ranges) or a strand_behavior parameter (e.g. overlap), because these parameters by default are set to auto, meaning that strand is considered only if it is valid. Indeed, see that this subregion is calculated from the left limit, even for the interval on the ‘-’ strand:

>>> g.slice_ranges(0, 3)
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                1        4  -
      1  |    chr1                1        4  #
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands (including non-genomic strands: #).

When running the code above, you should get a warning message like this:

UserWarning: slice_ranges: 'auto' use_strand treated as False due to invalid Strand values. Suppress this warning with use_strand=False
g.slice_ranges(0, 3)

You can check whether a PyRanges object has valid Strand information with property strand_valid:

>>> g.strand_valid
False

To fix the invalid strands by turning them to ‘+’, use method make_strand_valid:

>>> g2 = g.make_strand_valid()
>>> g2
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    chr1                1       11  -
      1  |    chr1                1       20  +
PyRanges with 2 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.

Lastly, some operations may result in PyRanges with duplicated indices, which is shown in the penultimate line of the string representation:

>>> gr1= pr.PyRanges(dict(Chromosome='chr1', Start=[1], End=[100]))
>>> gr2 = pr.PyRanges(dict(Chromosome='chr1', Start=[20, 50], End=[30, 60]))
>>> gr3 = gr1.subtract_overlaps(gr2)
>>> gr3
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                1       20
      0  |    chr1               30       50
      0  |    chr1               60      100
PyRanges with 3 rows, 3 columns, and 1 index columns (with 2 index duplicates).
Contains 1 chromosomes.

To remedy this, use pandas method reset_index:

>>> gr3 = gr3.reset_index(drop=True)
>>> gr3
  index  |    Chromosome      Start      End
  int64  |    str             int64    int64
-------  ---  ------------  -------  -------
      0  |    chr1                1       20
      1  |    chr1               30       50
      2  |    chr1               60      100
PyRanges with 3 rows, 3 columns, and 1 index columns.
Contains 1 chromosomes.

Column summary statistics

PyRanges columns are pandas Series, and they may be of different data types. The types are shown in the header shown in their string representation (see above). To see them all, use property dtypes like you do for dataframes:

>>> gr.dtypes
Chromosome    category
Start            int64
End              int64
Name               str
Score            int64
Strand        category
dtype: object

There are convenient methods inherited from pandas dataframes to inspect PyRanges objects, such as info:

>>> gr.info() 
<class 'pyranges1.core.pyranges_main.PyRanges'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Chromosome  20 non-null     category
 1   Start       20 non-null     int64
 2   End         20 non-null     int64
 3   Name        20 non-null     str
 4   Score       20 non-null     int64
 5   Strand      20 non-null     category
dtypes: category(2), int64(3), str(1)
memory usage: ... KB

On the other hand, describe reports aggregate metrics of numerical columns:

>>> gr.describe()
              Start           End  Score
count  2.000000e+01  2.000000e+01   20.0
mean   8.320972e+07  8.320975e+07    0.0
std    5.439939e+07  5.439939e+07    0.0
min    1.941900e+07  1.941902e+07    0.0
25%    3.369235e+07  3.369237e+07    0.0
50%    8.498244e+07  8.498247e+07    0.0
75%    1.245580e+08  1.245581e+08    0.0
max    1.942456e+08  1.942456e+08    0.0