How-to pages

Loading/Creating PyRanges

A PyRanges object can be built in four ways:

  1. from a Pandas dataframe

  2. using the PyRanges constructor with the chromosomes, starts and ends (and optionally strands), individually.

  3. using one of the custom reader functions for genomic data (read_bed, read_bam or read_gtf, read_gff3)

  4. from a dictionary (see below for the required structure)

Using a DataFrame

If you instantiate a PyRanges object from a dataframe, it should at least contain the columns Chromosome, Start and End. Coordinates follow the python standard (0-based, start included, end excluded). A column called Strand is optional. Any other columns in the dataframe are carried over as metadata.

>>> import pandas as pd, pyranges as pr
>>> df=pd.DataFrame(
... {'Chromosome':['chr1', 'chr1', 'chr1', 'chr3'],
...  'Start': [5, 20, 80, 10],
...  'End':   [10, 28, 95, 38],
...  'Strand':['+', '+', '-', '+'],
...  'title': ['a', 'b', 'c', 'd']}
... )
>>> df
  Chromosome  Start  End Strand title
0       chr1      5   10      +     a
1       chr1     20   28      +     b
2       chr1     80   95      -     c
3       chr3     10   38      +     d

To instantiate PyRanges from a dataframe, provide it as argument to the PyRanges constructor:

>>> p=pr.PyRanges(df)
>>> p
+--------------+-----------+-----------+--------------+------------+
| Chromosome   |     Start |       End | Strand       | title      |
| (category)   |   (int64) |   (int64) | (category)   | (object)   |
|--------------+-----------+-----------+--------------+------------|
| chr1         |         5 |        10 | +            | a          |
| chr1         |        20 |        28 | +            | b          |
| chr1         |        80 |        95 | -            | c          |
| chr3         |        10 |        38 | +            | d          |
+--------------+-----------+-----------+--------------+------------+
Stranded PyRanges object has 4 rows and 5 columns from 2 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

A DataFrame is most typically derived from a file. Here we load an example bed file included in the pyranges package as DataFrame, then instantiate a PyRanges object from it.

>>> chipseq = pr.get_example_path("chipseq.bed")
>>> df = pd.read_csv(chipseq, header=None, names="Chromosome Start End Name Score Strand".split(), sep="\t")
>>> print(df.head(2))
  Chromosome      Start        End Name  Score Strand
0       chr8   28510032   28510057   U0      0      -
1       chr7  107153363  107153388   U0      0      -
>>> print(df.tail(2))
     Chromosome      Start        End Name  Score Strand
9998       chr1  194245558  194245583   U0      0      +
9999       chr8   57916061   57916086   U0      0      +
>>> print(pr.PyRanges(df))
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome   | Start     | End       | Name       | Score     | Strand       |
| (category)   | (int64)   | (int64)   | (object)   | (int64)   | (category)   |
|--------------+-----------+-----------+------------+-----------+--------------|
| chr1         | 212609534 | 212609559 | U0         | 0         | +            |
| chr1         | 169887529 | 169887554 | U0         | 0         | +            |
| chr1         | 216711011 | 216711036 | U0         | 0         | +            |
| chr1         | 144227079 | 144227104 | U0         | 0         | +            |
| ...          | ...       | ...       | ...        | ...       | ...          |
| chrY         | 15224235  | 15224260  | U0         | 0         | -            |
| chrY         | 13517892  | 13517917  | U0         | 0         | -            |
| chrY         | 8010951   | 8010976   | U0         | 0         | -            |
| chrY         | 7405376   | 7405401   | U0         | 0         | -            |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Using constructor keywords

The other way to instantiate a PyRanges object is to use the constructor with keywords:

>>> gr = pr.PyRanges(chromosomes=df.Chromosome, starts=df.Start, ends=df.End)
>>> print(gr)
+--------------+-----------+-----------+
| Chromosome   | Start     | End       |
| (category)   | (int64)   | (int64)   |
|--------------+-----------+-----------|
| chr1         | 100079649 | 100079674 |
| chr1         | 212609534 | 212609559 |
| chr1         | 223587418 | 223587443 |
| chr1         | 202450161 | 202450186 |
| ...          | ...       | ...       |
| chrY         | 11942770  | 11942795  |
| chrY         | 8316773   | 8316798   |
| chrY         | 7463444   | 7463469   |
| chrY         | 7405376   | 7405401   |
+--------------+-----------+-----------+
Unstranded PyRanges object has 10,000 rows and 3 columns from 24 chromosomes.
For printing, the PyRanges was sorted on Chromosome.

Each column may be provided as Pandas Series, as above, or as basic Python datatypes:

>>> gr = pr.PyRanges(chromosomes="chr1", strands="+", starts=[0, 1, 2], ends=(3, 4, 5))
>>> print(gr)
+--------------+-----------+-----------+--------------+
| Chromosome   |     Start |       End | Strand       |
| (category)   |   (int64) |   (int64) | (category)   |
|--------------+-----------+-----------+--------------|
| chr1         |         0 |         3 | +            |
| chr1         |         1 |         4 | +            |
| chr1         |         2 |         5 | +            |
+--------------+-----------+-----------+--------------+
Stranded PyRanges object has 3 rows and 4 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
>>> gr = pr.PyRanges(chromosomes="chr1 chr2 chr3".split(), strands="+ - +".split(), starts=[0, 1, 2], ends=(3, 4, 5))
>>> print(gr)
+--------------+-----------+-----------+--------------+
| Chromosome   |     Start |       End | Strand       |
| (category)   |   (int64) |   (int64) | (category)   |
|--------------+-----------+-----------+--------------|
| chr1         |         0 |         3 | +            |
| chr2         |         1 |         4 | -            |
| chr3         |         2 |         5 | +            |
+--------------+-----------+-----------+--------------+
Stranded PyRanges object has 3 rows and 4 columns from 3 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Using read_bed, read_gtf, read_gff3 or read_bam

The pyranges library can create PyRanges from gff3 common file formats, namely gtf/gff, gff3, bed and bam. Note that, these files encoded interval coordinates as 1-based, start included, end included; when instancing a PyRanges object they are converted to the python convention.

>>> ensembl_path = pr.get_example_path("ensembl.gtf")
>>> gr = pr.read_gtf(ensembl_path)
>>> print(gr)
+--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------------+----------------+-------+
| Chromosome   | Source     | Feature      | Start     | End       | Score      | Strand       | Frame      | gene_id         | gene_version   | gene_name   | gene_source    | +14   |
| (category)   | (object)   | (category)   | (int64)   | (int64)   | (object)   | (category)   | (object)   | (object)        | (object)       | (object)    | (object)       | ...   |
|--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------------+----------------+-------|
| 1            | havana     | gene         | 11868     | 14409     | .          | +            | .          | ENSG00000223972 | 5              | DDX11L1     | havana         | ...   |
| 1            | havana     | transcript   | 11868     | 14409     | .          | +            | .          | ENSG00000223972 | 5              | DDX11L1     | havana         | ...   |
| 1            | havana     | exon         | 11868     | 12227     | .          | +            | .          | ENSG00000223972 | 5              | DDX11L1     | havana         | ...   |
| 1            | havana     | exon         | 12612     | 12721     | .          | +            | .          | ENSG00000223972 | 5              | DDX11L1     | havana         | ...   |
| ...          | ...        | ...          | ...       | ...       | ...        | ...          | ...        | ...             | ...            | ...         | ...            | ...   |
| 1            | ensembl    | transcript   | 120724    | 133723    | .          | -            | .          | ENSG00000238009 | 6              | AL627309.1  | ensembl_havana | ...   |
| 1            | ensembl    | exon         | 133373    | 133723    | .          | -            | .          | ENSG00000238009 | 6              | AL627309.1  | ensembl_havana | ...   |
| 1            | ensembl    | exon         | 129054    | 129223    | .          | -            | .          | ENSG00000238009 | 6              | AL627309.1  | ensembl_havana | ...   |
| 1            | ensembl    | exon         | 120873    | 120932    | .          | -            | .          | ENSG00000238009 | 6              | AL627309.1  | ensembl_havana | ...   |
+--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------------+----------------+-------+
Stranded PyRanges object has 95 rows and 26 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
14 hidden columns: gene_biotype, transcript_id, transcript_version, transcript_name, transcript_source, transcript_biotype, tag, transcript_support_level, exon_number, exon_id, exon_version, ... (+ 3 more.)

To read bam files the optional bamread-library must be installed. Use:

conda install -c bioconda bamread

or:

pip install bamread

to install it.

read_bam takes the arguments sparse, mapq, required_flag, filter_flag, which have the default values True, 0, 0 and 1540, respectively. With sparse True, only the columns ['Chromosome', 'Start', 'End', 'Strand', 'Flag'] are fetched. Setting sparse to False additionally gives you the columns ['QueryStart', 'QueryEnd', 'Name', 'Cigar', 'Quality'], but is more time and memory-consuming. All the reader functions also take the flag as_df to return a DataFrame instead of a PyRanges object.

Using from_dict

With this method, an input dictionary with the structure shown below must be provided:

>>> f1 = pr.data.f1()
>>> d = f1.to_example(n=10)
>>> print(d)
{'Chromosome': ['chr1', 'chr1', 'chr1'], 'Start': [3, 8, 5], 'End': [6, 9, 7], 'Name': ['interval1', 'interval3', 'interval2'], 'Score': [0, 0, 0], 'Strand': ['+', '+', '-']}
>>> print(pr.from_dict(d))
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome   |     Start |       End | Name       |     Score | Strand       |
| (category)   |   (int64) |   (int64) | (object)   |   (int64) | (category)   |
|--------------+-----------+-----------+------------+-----------+--------------|
| chr1         |         3 |         6 | interval1  |         0 | +            |
| chr1         |         8 |         9 | interval3  |         0 | +            |
| chr1         |         5 |         7 | interval2  |         0 | -            |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 3 rows and 6 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Writing to disk

The PyRanges can be written to several formats, namely csv, gtf, gff3 and bigwig. If no path-argument is given, the string representation of the data is returned. (It may potentially be very large.) If a path is given, it is taken as the path to the file to be written; in this case, the return value is the object itself, to allow inserting write methods into method call chains.

Writing in tabular formats

Tabular formats such as csv, gtf, gff3 are the most popular for genomic annotations. You can readily write them using the correspondent methods. N

>>> import pyranges as pr
>>> gr = pr.data.chipseq()
>>> gr.to_gtf("chipseq.gtf")
>>> #file chipseq.gtf has been created
>>> print(gr.head())
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome   |     Start |       End | Name       |     Score | Strand       |
| (category)   |   (int64) |   (int64) | (object)   |   (int64) | (category)   |
|--------------+-----------+-----------+------------+-----------+--------------|
| chr1         | 212609534 | 212609559 | U0         |         0 | +            |
| chr1         | 169887529 | 169887554 | U0         |         0 | +            |
| chr1         | 216711011 | 216711036 | U0         |         0 | +            |
| chr1         | 144227079 | 144227104 | U0         |         0 | +            |
| chr1         | 148177825 | 148177850 | U0         |         0 | +            |
| chr1         | 113486652 | 113486677 | U0         |         0 | +            |
| chr1         |  27024083 |  27024108 | U0         |         0 | +            |
| chr1         |  37865066 |  37865091 | U0         |         0 | +            |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 8 rows and 6 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
>>> print(gr.head().to_gtf()) 
chr1  .       .       212609535       212609559       0       +       .       Name "U0";
chr1  .       .       169887530       169887554       0       +       .       Name "U0";
chr1  .       .       216711012       216711036       0       +       .       Name "U0";
chr1  .       .       144227080       144227104       0       +       .       Name "U0";
chr1  .       .       148177826       148177850       0       +       .       Name "U0";
chr1  .       .       113486653       113486677       0       +       .       Name "U0";
chr1  .       .       27024084        27024108        0       +       .       Name "U0";
chr1  .       .       37865067        37865091        0       +       .       Name "U0";
>>> print(gr.head().to_gff3()) 
chr1  .       .       212609535       212609559       0       +       .       Name=U0
chr1  .       .       169887530       169887554       0       +       .       Name=U0
chr1  .       .       216711012       216711036       0       +       .       Name=U0
chr1  .       .       144227080       144227104       0       +       .       Name=U0
chr1  .       .       148177826       148177850       0       +       .       Name=U0
chr1  .       .       113486653       113486677       0       +       .       Name=U0
chr1  .       .       27024084        27024108        0       +       .       Name=U0
chr1  .       .       37865067        37865091        0       +       .       Name=U0

Methods to_gff3 and to_gtf have a default mapping of PyRanges columns to GFF/GTF fields. All extra (“metadata”) columns are put in the last field:

>>> gr.Label='something'
>>> print(gr.head().to_gtf()) 
chr1  .       .       212609535       212609559       0       +       .       Name "U0"; Label "something";
chr1  .       .       169887530       169887554       0       +       .       Name "U0"; Label "something";
chr1  .       .       216711012       216711036       0       +       .       Name "U0"; Label "something";
chr1  .       .       144227080       144227104       0       +       .       Name "U0"; Label "something";
chr1  .       .       148177826       148177850       0       +       .       Name "U0"; Label "something";
chr1  .       .       113486653       113486677       0       +       .       Name "U0"; Label "something";
chr1  .       .       27024084        27024108        0       +       .       Name "U0"; Label "something";
chr1  .       .       37865067        37865091        0       +       .       Name "U0"; Label "something";

Such mapping, as well as which attribute(s) are included as last field, can be altered:

>>> gr.Tag='sometext'
>>> print(gr.head().to_gtf(map_cols={"feature":"Name", "attribute":"Tag"})) 
chr1  .       U0      212609535       212609559       0       +       .       Tag "sometext";
chr1  .       U0      169887530       169887554       0       +       .       Tag "sometext";
chr1  .       U0      216711012       216711036       0       +       .       Tag "sometext";
chr1  .       U0      144227080       144227104       0       +       .       Tag "sometext";
chr1  .       U0      148177826       148177850       0       +       .       Tag "sometext";
chr1  .       U0      113486653       113486677       0       +       .       Tag "sometext";
chr1  .       U0      27024084        27024108        0       +       .       Tag "sometext";
chr1  .       U0      37865067        37865091        0       +       .       Tag "sometext";

Note that the gtf and gff3 formats are 1-based with both Start and End included. Instead, csv uses the python/pyranges notation:

>>> print(gr.head().to_csv())
Chromosome,Start,End,Name,Score,Strand,Label,Tag
chr1,212609534,212609559,U0,0,+,something,sometext
chr1,169887529,169887554,U0,0,+,something,sometext
chr1,216711011,216711036,U0,0,+,something,sometext
chr1,144227079,144227104,U0,0,+,something,sometext
chr1,148177825,148177850,U0,0,+,something,sometext
chr1,113486652,113486677,U0,0,+,something,sometext
chr1,27024083,27024108,U0,0,+,something,sometext
chr1,37865066,37865091,U0,0,+,something,sometext

The to_csv method takes the arguments header and sep:

>>> print(gr.drop(['Label', 'Tag']).head().to_csv(sep="\t", header=False)) 
chr1  212609534       212609559       U0      0       +
chr1  169887529       169887554       U0      0       +
chr1  216711011       216711036       U0      0       +
chr1  144227079       144227104       U0      0       +
chr1  148177825       148177850       U0      0       +
chr1  113486652       113486677       U0      0       +
chr1  27024083        27024108        U0      0       +
chr1  37865066        37865091        U0      0       +

The bigwig format differs substantially from the formats above. Bigwig is a binary format, and it is typically used for large continuous quantitative data along a genome sequence.

The pyranges library can also create bigwigs, but it needs the library pybigwig which is not installed by default.

Use:

conda install -c bioconda pybigwig

or:

pip install pybigwig

to install it.

The bigwig writer needs to know the chromosome sizes, provided as a dictionary {chromosome_name: size} or an analogous PyRanges with sizes as End (with Start values set to zero).

You can derive chromosome sizes from a fasta file using pyfaidx. Install with:

conda install -c bioconda pyfaidx

or

pip install pyfaidx

Once you obtained chromosome sizes, you are ready to write your PyRanges object to a bigwig file:

>>> gr.to_bigwig("chipseq.bw", chromsizes) 
>>> # file chipseq.bw has been created

Bigwig is typically used to represent a coverage of some type. To compute it from an arbitrary value column, use the value_col argument. See the API for additional options. If you want to write one bigwig for each strand, you need to do it manually.

>>> gr["+"].to_bigwig("chipseq_plus.bw", chromsizes) 
>>> gr["-"].to_bigwig("chipseq_minus.bw", chromsizes) 

Inspecting PyRanges

The PyRanges method print provides an overview of its data:

>>> import pyranges as pr
>>> gr = pr.data.chipseq()
>>> gr.print()
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome   | Start     | End       | Name       | Score     | Strand       |
| (category)   | (int64)   | (int64)   | (object)   | (int64)   | (category)   |
|--------------+-----------+-----------+------------+-----------+--------------|
| chr1         | 212609534 | 212609559 | U0         | 0         | +            |
| chr1         | 169887529 | 169887554 | U0         | 0         | +            |
| chr1         | 216711011 | 216711036 | U0         | 0         | +            |
| chr1         | 144227079 | 144227104 | U0         | 0         | +            |
| ...          | ...       | ...       | ...        | ...       | ...          |
| chrY         | 15224235  | 15224260  | U0         | 0         | -            |
| chrY         | 13517892  | 13517917  | U0         | 0         | -            |
| chrY         | 8010951   | 8010976   | U0         | 0         | -            |
| chrY         | 7405376   | 7405401   | U0         | 0         | -            |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

The same method is invoked under the hood anytime we request a string representation:

>>> print(str(gr))
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome   | Start     | End       | Name       | Score     | Strand       |
| (category)   | (int64)   | (int64)   | (object)   | (int64)   | (category)   |
|--------------+-----------+-----------+------------+-----------+--------------|
| chr1         | 212609534 | 212609559 | U0         | 0         | +            |
| chr1         | 169887529 | 169887554 | U0         | 0         | +            |
| chr1         | 216711011 | 216711036 | U0         | 0         | +            |
| chr1         | 144227079 | 144227104 | U0         | 0         | +            |
| ...          | ...       | ...       | ...        | ...       | ...          |
| chrY         | 15224235  | 15224260  | U0         | 0         | -            |
| chrY         | 13517892  | 13517917  | U0         | 0         | -            |
| chrY         | 8010951   | 8010976   | U0         | 0         | -            |
| chrY         | 7405376   | 7405401   | U0         | 0         | -            |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

As explained in the tutorial, PyRanges objects consist of collections of DataFrames, organized per chromosome (and strand, if Stranded). When printed, they are displayed as a continuous table, ordered by Chromosome (and strand).

The window width affects the output of print: columns that do not fit are hidden. When this happens, a message is printed after the table:

>>> gr.new_col = 'value'
>>> gr.another_col = 99
>>> gr.print()
+--------------+-----------+-----------+------------+-----------+--------------+------------+---------------+
| Chromosome   | Start     | End       | Name       | Score     | Strand       | new_col    | another_col   |
| (category)   | (int64)   | (int64)   | (object)   | (int64)   | (category)   | (object)   | (int64)       |
|--------------+-----------+-----------+------------+-----------+--------------+------------+---------------|
| chr1         | 212609534 | 212609559 | U0         | 0         | +            | value      | 99            |
| chr1         | 169887529 | 169887554 | U0         | 0         | +            | value      | 99            |
| chr1         | 216711011 | 216711036 | U0         | 0         | +            | value      | 99            |
| chr1         | 144227079 | 144227104 | U0         | 0         | +            | value      | 99            |
| ...          | ...       | ...       | ...        | ...       | ...          | ...        | ...           |
| chrY         | 15224235  | 15224260  | U0         | 0         | -            | value      | 99            |
| chrY         | 13517892  | 13517917  | U0         | 0         | -            | value      | 99            |
| chrY         | 8010951   | 8010976   | U0         | 0         | -            | value      | 99            |
| chrY         | 7405376   | 7405401   | U0         | 0         | -            | value      | 99            |
+--------------+-----------+-----------+------------+-----------+--------------+------------+---------------+
Stranded PyRanges object has 10,000 rows and 8 columns from 24 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Only a limited number of rows are displayed, which are taken from the top and bottom of the table. This is 8 by default, and can be redefined through the first argument of print, named n:

>>> gr.print(2)
+--------------+-----------+-----------+------------+-----------+--------------+------------+---------------+
| Chromosome   | Start     | End       | Name       | Score     | Strand       | new_col    | another_col   |
| (category)   | (int64)   | (int64)   | (object)   | (int64)   | (category)   | (object)   | (int64)       |
|--------------+-----------+-----------+------------+-----------+--------------+------------+---------------|
| chr1         | 212609534 | 212609559 | U0         | 0         | +            | value      | 99            |
| ...          | ...       | ...       | ...        | ...       | ...          | ...        | ...           |
| chrY         | 7405376   | 7405401   | U0         | 0         | -            | value      | 99            |
+--------------+-----------+-----------+------------+-----------+--------------+------------+---------------+
Stranded PyRanges object has 10,000 rows and 8 columns from 24 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
>>> gr.print(n=20)
+--------------+-----------+-----------+------------+-----------+--------------+------------+---------------+
| Chromosome   | Start     | End       | Name       | Score     | Strand       | new_col    | another_col   |
| (category)   | (int64)   | (int64)   | (object)   | (int64)   | (category)   | (object)   | (int64)       |
|--------------+-----------+-----------+------------+-----------+--------------+------------+---------------|
| chr1         | 212609534 | 212609559 | U0         | 0         | +            | value      | 99            |
| chr1         | 169887529 | 169887554 | U0         | 0         | +            | value      | 99            |
| chr1         | 216711011 | 216711036 | U0         | 0         | +            | value      | 99            |
| chr1         | 144227079 | 144227104 | U0         | 0         | +            | value      | 99            |
| chr1         | 148177825 | 148177850 | U0         | 0         | +            | value      | 99            |
| chr1         | 113486652 | 113486677 | U0         | 0         | +            | value      | 99            |
| chr1         | 27024083  | 27024108  | U0         | 0         | +            | value      | 99            |
| chr1         | 37865066  | 37865091  | U0         | 0         | +            | value      | 99            |
| chr1         | 47488200  | 47488225  | U0         | 0         | +            | value      | 99            |
| chr1         | 197075093 | 197075118 | U0         | 0         | +            | value      | 99            |
| ...          | ...       | ...       | ...        | ...       | ...          | ...        | ...           |
| chrY         | 21707662  | 21707687  | U0         | 0         | -            | value      | 99            |
| chrY         | 7761026   | 7761051   | U0         | 0         | -            | value      | 99            |
| chrY         | 22210637  | 22210662  | U0         | 0         | -            | value      | 99            |
| chrY         | 14774053  | 14774078  | U0         | 0         | -            | value      | 99            |
| chrY         | 16495497  | 16495522  | U0         | 0         | -            | value      | 99            |
| chrY         | 7046809   | 7046834   | U0         | 0         | -            | value      | 99            |
| chrY         | 15224235  | 15224260  | U0         | 0         | -            | value      | 99            |
| chrY         | 13517892  | 13517917  | U0         | 0         | -            | value      | 99            |
| chrY         | 8010951   | 8010976   | U0         | 0         | -            | value      | 99            |
| chrY         | 7405376   | 7405401   | U0         | 0         | -            | value      | 99            |
+--------------+-----------+-----------+------------+-----------+--------------+------------+---------------+
Stranded PyRanges object has 10,000 rows and 8 columns from 24 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Argument formatting allows to fine-tune appearance. It takes a dictionary with any column name as key, and a string as value which follows the python format syntax:

>>> gr.print(formatting={
...     'Score':'{:.2f}',
...       'End':'{:e}',
...       'Start':'{:,}',
...       'Name':'name={}',
...       })
+--------------+-------------+--------------+------------+-----------+--------------+------------+---------------+
| Chromosome   | Start       | End          | Name       | Score     | Strand       | new_col    | another_col   |
| (category)   | (int64)     | (int64)      | (object)   | (int64)   | (category)   | (object)   | (int64)       |
|--------------+-------------+--------------+------------+-----------+--------------+------------+---------------|
| chr1         | 212,609,534 | 2.126096e+08 | name=U0    | 0.00      | +            | value      | 99            |
| chr1         | 169,887,529 | 1.698876e+08 | name=U0    | 0.00      | +            | value      | 99            |
| chr1         | 216,711,011 | 2.167110e+08 | name=U0    | 0.00      | +            | value      | 99            |
| chr1         | 144,227,079 | 1.442271e+08 | name=U0    | 0.00      | +            | value      | 99            |
| ...          | ...         | ...          | ...        | ...       | ...          | ...        | ...           |
| chrY         | 15,224,235  | 1.522426e+07 | name=U0    | 0.00      | -            | value      | 99            |
| chrY         | 13,517,892  | 1.351792e+07 | name=U0    | 0.00      | -            | value      | 99            |
| chrY         | 8,010,951   | 8.010976e+06 | name=U0    | 0.00      | -            | value      | 99            |
| chrY         | 7,405,376   | 7.405401e+06 | name=U0    | 0.00      | -            | value      | 99            |
+--------------+-------------+--------------+------------+-----------+--------------+------------+---------------+
Stranded PyRanges object has 10,000 rows and 8 columns from 24 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

PyRanges columns are pandas Series, and they may be of different data types. The types are shown in the header shown with print (see above). To see them all, use property dtypes:

>>> gr.dtypes
Chromosome     category
Start             int64
End               int64
Name             object
Score             int64
Strand         category
new_col          object
another_col       int64
dtype: object

If you want to inspect more information from a PyRanges object, remember that you can always transform it into a pandas DataFrame, which gives access to all its methods. For example, you may employ pandas info and describe:

>>> gr.df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   Chromosome   10000 non-null  category
 1   Start        10000 non-null  int64
 2   End          10000 non-null  int64
 3   Name         10000 non-null  object
 4   Score        10000 non-null  int64
 5   Strand       10000 non-null  category
 6   new_col      10000 non-null  object
 7   another_col  10000 non-null  int64
dtypes: category(2), int64(2), int64(2), object(2)
memory usage: 489.3+ KB
>>> gr.df.describe()
              Start           End    Score  another_col
count  1.000000e+04  1.000000e+04  10000.0      10000.0
mean   8.087570e+07  8.087573e+07      0.0         99.0
std    5.572825e+07  5.572825e+07      0.0          0.0
min    1.361100e+04  1.363600e+04      0.0         99.0
25%    3.550257e+07  3.550260e+07      0.0         99.0
50%    7.030672e+07  7.030674e+07      0.0         99.0
75%    1.167902e+08  1.167902e+08      0.0         99.0
max    2.471349e+08  2.471349e+08      0.0         99.0

Accessing data

Selecting rows

As seen in the tutorial, PyRanges provides various ways to select a subset of rows. All of these methods return a (smaller) copy of the original object.

One way is to index by genomic region, which may take any of the following syntaxes:

  • chromosome

  • chromosome, position slice

  • chromosome, strand, position slice

Here’s one example for each:

>>> import pyranges as pr
>>> gr = pr.data.chipseq()
>>> gr['chrX']
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome   | Start     | End       | Name       | Score     | Strand       |
| (category)   | (int64)   | (int64)   | (object)   | (int64)   | (category)   |
|--------------+-----------+-----------+------------+-----------+--------------|
| chrX         | 13843759  | 13843784  | U0         | 0         | +            |
| chrX         | 114673546 | 114673571 | U0         | 0         | +            |
| chrX         | 131816774 | 131816799 | U0         | 0         | +            |
| chrX         | 45504745  | 45504770  | U0         | 0         | +            |
| ...          | ...       | ...       | ...        | ...       | ...          |
| chrX         | 146694149 | 146694174 | U0         | 0         | -            |
| chrX         | 5044527   | 5044552   | U0         | 0         | -            |
| chrX         | 15281263  | 15281288  | U0         | 0         | -            |
| chrX         | 120273723 | 120273748 | U0         | 0         | -            |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 282 rows and 6 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

>>> gr['chr1', 1000000:3000000]
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome   |     Start |       End | Name       |     Score | Strand       |
| (category)   |   (int64) |   (int64) | (object)   |   (int64) | (category)   |
|--------------+-----------+-----------+------------+-----------+--------------|
| chr1         |   1541598 |   1541623 | U0         |         0 | +            |
| chr1         |   1599121 |   1599146 | U0         |         0 | +            |
| chr1         |   1325303 |   1325328 | U0         |         0 | -            |
| chr1         |   1820285 |   1820310 | U0         |         0 | -            |
| chr1         |   2448322 |   2448347 | U0         |         0 | -            |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 5 rows and 6 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

>>> gr['chr1', '-', 1000000:3000000]
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome   |     Start |       End | Name       |     Score | Strand       |
| (category)   |   (int64) |   (int64) | (object)   |   (int64) | (category)   |
|--------------+-----------+-----------+------------+-----------+--------------|
| chr1         |   1325303 |   1325328 | U0         |         0 | -            |
| chr1         |   1820285 |   1820310 | U0         |         0 | -            |
| chr1         |   2448322 |   2448347 | U0         |         0 | -            |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 3 rows and 6 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Simple forms of row selection are done through methods head and tail, which return the top or bottom N rows, respectively, where N is 8 by default:

>>> gr.head()
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome   |     Start |       End | Name       |     Score | Strand       |
| (category)   |   (int64) |   (int64) | (object)   |   (int64) | (category)   |
|--------------+-----------+-----------+------------+-----------+--------------|
| chr1         | 212609534 | 212609559 | U0         |         0 | +            |
| chr1         | 169887529 | 169887554 | U0         |         0 | +            |
| chr1         | 216711011 | 216711036 | U0         |         0 | +            |
| chr1         | 144227079 | 144227104 | U0         |         0 | +            |
| chr1         | 148177825 | 148177850 | U0         |         0 | +            |
| chr1         | 113486652 | 113486677 | U0         |         0 | +            |
| chr1         |  27024083 |  27024108 | U0         |         0 | +            |
| chr1         |  37865066 |  37865091 | U0         |         0 | +            |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 8 rows and 6 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
>>> gr.tail()
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome   |     Start |       End | Name       |     Score | Strand       |
| (category)   |   (int64) |   (int64) | (object)   |   (int64) | (category)   |
|--------------+-----------+-----------+------------+-----------+--------------|
| chrY         |  22210637 |  22210662 | U0         |         0 | -            |
| chrY         |  14774053 |  14774078 | U0         |         0 | -            |
| chrY         |  16495497 |  16495522 | U0         |         0 | -            |
| chrY         |   7046809 |   7046834 | U0         |         0 | -            |
| chrY         |  15224235 |  15224260 | U0         |         0 | -            |
| chrY         |  13517892 |  13517917 | U0         |         0 | -            |
| chrY         |   8010951 |   8010976 | U0         |         0 | -            |
| chrY         |   7405376 |   7405401 | U0         |         0 | -            |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 8 rows and 6 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

The most important form of row selection is by indexing with a boolean Series. This is typically generated from a column through a comparison operator. Let’s see it with some other example data:

>>> gg = pr.data.aorta()
>>> gg.print(n=20)
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome   |     Start |       End | Name       |     Score | Strand       |
| (category)   |   (int64) |   (int64) | (object)   |   (int64) | (category)   |
|--------------+-----------+-----------+------------+-----------+--------------|
| chr1         |      9939 |     10138 | H3K27me3   |         7 | +            |
| chr1         |      9953 |     10152 | H3K27me3   |         5 | +            |
| chr1         |     10024 |     10223 | H3K27me3   |         1 | +            |
| chr1         |     10246 |     10445 | H3K27me3   |         4 | +            |
| chr1         |    110246 |    110445 | H3K27me3   |         1 | +            |
| chr1         |      9916 |     10115 | H3K27me3   |         5 | -            |
| chr1         |      9951 |     10150 | H3K27me3   |         8 | -            |
| chr1         |      9978 |     10177 | H3K27me3   |         7 | -            |
| chr1         |     10001 |     10200 | H3K27me3   |         5 | -            |
| chr1         |     10127 |     10326 | H3K27me3   |         1 | -            |
| chr1         |     10241 |     10440 | H3K27me3   |         6 | -            |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 11 rows and 6 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Below, we produce a boolean Series:

>>> gg.Score > 5
1      True
3     False
6     False
9     False
10    False
0     False
2      True
4      True
5     False
7     False
8      True
Name: Score, dtype: bool

And we use it to select the rows in which the column Score has a value greater than 5:

>>> gg[gg.Score>5]
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome   |     Start |       End | Name       |     Score | Strand       |
| (category)   |   (int64) |   (int64) | (object)   |   (int64) | (category)   |
|--------------+-----------+-----------+------------+-----------+--------------|
| chr1         |      9939 |     10138 | H3K27me3   |         7 | +            |
| chr1         |      9951 |     10150 | H3K27me3   |         8 | -            |
| chr1         |      9978 |     10177 | H3K27me3   |         7 | -            |
| chr1         |     10241 |     10440 | H3K27me3   |         6 | -            |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 4 rows and 6 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

As pandas users know, these logical operators can be employed with boolean Series:

  • “&” = element-wise AND operator

  • “|” = element-wise OR operator

  • “~” = NOT operator, inverts the values of the Series on its right

When using logical operators, make sure to parenthesize properly.

Let’s get the + intervals with Score 1 starting before 12,000 or ending after 100,000:

>>> gg[
...    (gg.Score==1) &
...    (gg.Strand=='+') &
...    ((gg.Start<12000) | (gg.End>100000))
...    ]
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome   |     Start |       End | Name       |     Score | Strand       |
| (category)   |   (int64) |   (int64) | (object)   |   (int64) | (category)   |
|--------------+-----------+-----------+------------+-----------+--------------|
| chr1         |     10024 |     10223 | H3K27me3   |         1 | +            |
| chr1         |    110246 |    110445 | H3K27me3   |         1 | +            |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 2 rows and 6 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Let’s invert the selection, i.e. taking all intervals that do not fit the above criteria:

>>> gg[~(
...      (gg.Score==1) &
...      (gg.Strand=='+') &
...      ((gg.Start<12000) | (gg.End>100000))
...     )
...    ]
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome   | Start     | End       | Name       | Score     | Strand       |
| (category)   | (int64)   | (int64)   | (object)   | (int64)   | (category)   |
|--------------+-----------+-----------+------------+-----------+--------------|
| chr1         | 9939      | 10138     | H3K27me3   | 7         | +            |
| chr1         | 9953      | 10152     | H3K27me3   | 5         | +            |
| chr1         | 10246     | 10445     | H3K27me3   | 4         | +            |
| chr1         | 9916      | 10115     | H3K27me3   | 5         | -            |
| ...          | ...       | ...       | ...        | ...       | ...          |
| chr1         | 9978      | 10177     | H3K27me3   | 7         | -            |
| chr1         | 10001     | 10200     | H3K27me3   | 5         | -            |
| chr1         | 10127     | 10326     | H3K27me3   | 1         | -            |
| chr1         | 10241     | 10440     | H3K27me3   | 6         | -            |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 9 rows and 6 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Another way to select rows is the subset method, for which you provide a function which is then applied to each DataFrame of the collection, and which must return a boolean Series. Typically, you define a lambda function on-the-fly:

>>> # the following is equivalent to gg[gg.Score.isin([2, 4, 6])
>>> gg.subset(lambda x:x.Score.isin([2,4,6]))
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome   |     Start |       End | Name       |     Score | Strand       |
| (category)   |   (int64) |   (int64) | (object)   |   (int64) | (category)   |
|--------------+-----------+-----------+------------+-----------+--------------|
| chr1         |     10246 |     10445 | H3K27me3   |         4 | +            |
| chr1         |     10241 |     10440 | H3K27me3   |         6 | -            |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 2 rows and 6 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

The method subset is suited for complex pandas operations, and it is also useful in method call chains.

Lastly, a fairly specific form of row selection is drop_duplicate_positions, which gets rid of interval with the same coordinates:

>>> d = {"Chromosome": [1, 1, 1, 2, 2],
...      "Start": [1, 1, 2, 1, 8],
...      "End": [4, 4, 9, 4, 12],
...      "Strand": ["+", "+", "+", "+","-"],
...      "ID": ["a", "b", "c", "d", "e"]}
>>> p = pr.from_dict(d)
>>> p
+--------------+-----------+-----------+--------------+------------+
|   Chromosome |     Start |       End | Strand       | ID         |
|   (category) |   (int64) |   (int64) | (category)   | (object)   |
|--------------+-----------+-----------+--------------+------------|
|            1 |         1 |         4 | +            | a          |
|            1 |         1 |         4 | +            | b          |
|            1 |         2 |         9 | +            | c          |
|            2 |         1 |         4 | +            | d          |
|            2 |         8 |        12 | -            | e          |
+--------------+-----------+-----------+--------------+------------+
Stranded PyRanges object has 5 rows and 5 columns from 2 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
>>> p.drop_duplicate_positions()
+--------------+-----------+-----------+--------------+------------+
|   Chromosome |     Start |       End | Strand       | ID         |
|   (category) |   (int64) |   (int64) | (category)   | (object)   |
|--------------+-----------+-----------+--------------+------------|
|            1 |         1 |         4 | +            | a          |
|            1 |         2 |         9 | +            | c          |
|            2 |         1 |         4 | +            | d          |
|            2 |         8 |        12 | -            | e          |
+--------------+-----------+-----------+--------------+------------+
Stranded PyRanges object has 4 rows and 5 columns from 2 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Normally, the first instance of duplicated intervals is retained. Through argument keep=False, you can decide to remove them all:

>>> p.drop_duplicate_positions(keep=False)
+--------------+-----------+-----------+--------------+------------+
|   Chromosome |     Start |       End | Strand       | ID         |
|   (category) |   (int64) |   (int64) | (category)   | (object)   |
|--------------+-----------+-----------+--------------+------------|
|            1 |         2 |         9 | +            | c          |
|            2 |         1 |         4 | +            | d          |
|            2 |         8 |        12 | -            | e          |
+--------------+-----------+-----------+--------------+------------+
Stranded PyRanges object has 3 rows and 5 columns from 2 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Selecting columns

As previously seen, single PyRanges column (which are pandas Series) can be extracted through the dot notation:

>>> gr = pr.data.chipseq()
>>> gr.Chromosome
18      chr1
70      chr1
129     chr1
170     chr1
196     chr1
        ...
3023    chrY
3131    chrY
3816    chrY
3897    chrY
9570    chrY
Name: Chromosome, Length: 10000, dtype: category
Categories (24, object): ['chr1', 'chr10', 'chr11', 'chr12', ..., 'chr8', 'chr9', 'chrX', 'chrY']

The same syntax can be used for the core PyRanges columns (Chromosome, Strand, Start, End) or for metadata columns:

>>> gr.Name
18      U0
70      U0
129     U0
170     U0
196     U0
        ..
3023    U0
3131    U0
3816    U0
3897    U0
9570    U0
Name: Name, Length: 10000, dtype: object

This syntax is analogous to pandas Dataframes. Note that, however, the bracket column selection in pandas does not work in the same way in PyRanges:

>>> df=gr.df
>>> df['End']
0       212609559
1       169887554
2       216711036
3       144227104
4       148177850
          ...
9995      7046834
9996     15224260
9997     13517917
9998      8010976
9999      7405401
Name: End, Length: 10000, dtype: int64
>>> gr['End']
Empty PyRanges

Because the last expression is evaluated as a genomic region, i.e. a form of row selection: it is searching for intervals on a Chromosome named “End”, and finds none. Indeed, this fetches intervals on the chrY:

>>> gr['chrY']
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome   | Start     | End       | Name       | Score     | Strand       |
| (category)   | (int64)   | (int64)   | (object)   | (int64)   | (category)   |
|--------------+-----------+-----------+------------+-----------+--------------|
| chrY         | 12930373  | 12930398  | U0         | 0         | +            |
| chrY         | 15548022  | 15548047  | U0         | 0         | +            |
| chrY         | 7194340   | 7194365   | U0         | 0         | +            |
| chrY         | 21559181  | 21559206  | U0         | 0         | +            |
| ...          | ...       | ...       | ...        | ...       | ...          |
| chrY         | 15224235  | 15224260  | U0         | 0         | -            |
| chrY         | 13517892  | 13517917  | U0         | 0         | -            |
| chrY         | 8010951   | 8010976   | U0         | 0         | -            |
| chrY         | 7405376   | 7405401   | U0         | 0         | -            |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 23 rows and 6 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

You can provide a list of column names in the bracket notation to select those columns. Pyranges will still return a PyRanges object, therefore retaining the core columns regardless of whether they were selected or not:

>>> gr[ ['Name'] ]
+--------------+-----------+-----------+------------+--------------+
| Chromosome   | Start     | End       | Name       | Strand       |
| (category)   | (int64)   | (int64)   | (object)   | (category)   |
|--------------+-----------+-----------+------------+--------------|
| chr1         | 212609534 | 212609559 | U0         | +            |
| chr1         | 169887529 | 169887554 | U0         | +            |
| chr1         | 216711011 | 216711036 | U0         | +            |
| chr1         | 144227079 | 144227104 | U0         | +            |
| ...          | ...       | ...       | ...        | ...          |
| chrY         | 15224235  | 15224260  | U0         | -            |
| chrY         | 13517892  | 13517917  | U0         | -            |
| chrY         | 8010951   | 8010976   | U0         | -            |
| chrY         | 7405376   | 7405401   | U0         | -            |
+--------------+-----------+-----------+------------+--------------+
Stranded PyRanges object has 10,000 rows and 5 columns from 24 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

This is convenient to reduce genome annotation that consists of many columns:

>>> ensembl_path = pr.get_example_path("ensembl.gtf")
>>> ge = pr.read_gtf(ensembl_path)
>>> ge
+--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------------+----------------+-------+
| Chromosome   | Source     | Feature      | Start     | End       | Score      | Strand       | Frame      | gene_id         | gene_version   | gene_name   | gene_source    | +14   |
| (category)   | (object)   | (category)   | (int64)   | (int64)   | (object)   | (category)   | (object)   | (object)        | (object)       | (object)    | (object)       | ...   |
|--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------------+----------------+-------|
| 1            | havana     | gene         | 11868     | 14409     | .          | +            | .          | ENSG00000223972 | 5              | DDX11L1     | havana         | ...   |
| 1            | havana     | transcript   | 11868     | 14409     | .          | +            | .          | ENSG00000223972 | 5              | DDX11L1     | havana         | ...   |
| 1            | havana     | exon         | 11868     | 12227     | .          | +            | .          | ENSG00000223972 | 5              | DDX11L1     | havana         | ...   |
| 1            | havana     | exon         | 12612     | 12721     | .          | +            | .          | ENSG00000223972 | 5              | DDX11L1     | havana         | ...   |
| ...          | ...        | ...          | ...       | ...       | ...        | ...          | ...        | ...             | ...            | ...         | ...            | ...   |
| 1            | ensembl    | transcript   | 120724    | 133723    | .          | -            | .          | ENSG00000238009 | 6              | AL627309.1  | ensembl_havana | ...   |
| 1            | ensembl    | exon         | 133373    | 133723    | .          | -            | .          | ENSG00000238009 | 6              | AL627309.1  | ensembl_havana | ...   |
| 1            | ensembl    | exon         | 129054    | 129223    | .          | -            | .          | ENSG00000238009 | 6              | AL627309.1  | ensembl_havana | ...   |
| 1            | ensembl    | exon         | 120873    | 120932    | .          | -            | .          | ENSG00000238009 | 6              | AL627309.1  | ensembl_havana | ...   |
+--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------------+----------------+-------+
Stranded PyRanges object has 95 rows and 26 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
14 hidden columns: gene_biotype, transcript_id, transcript_version, transcript_name, transcript_source, transcript_biotype, tag, transcript_support_level, exon_number, exon_id, exon_version, ... (+ 3 more.)
>>> ge[ ['gene_id', 'gene_name'] ]
+--------------+-----------+-----------+--------------+-----------------+-------------+
| Chromosome   | Start     | End       | Strand       | gene_id         | gene_name   |
| (category)   | (int64)   | (int64)   | (category)   | (object)        | (object)    |
|--------------+-----------+-----------+--------------+-----------------+-------------|
| 1            | 11868     | 14409     | +            | ENSG00000223972 | DDX11L1     |
| 1            | 11868     | 14409     | +            | ENSG00000223972 | DDX11L1     |
| 1            | 11868     | 12227     | +            | ENSG00000223972 | DDX11L1     |
| 1            | 12612     | 12721     | +            | ENSG00000223972 | DDX11L1     |
| ...          | ...       | ...       | ...          | ...             | ...         |
| 1            | 120724    | 133723    | -            | ENSG00000238009 | AL627309.1  |
| 1            | 133373    | 133723    | -            | ENSG00000238009 | AL627309.1  |
| 1            | 129054    | 129223    | -            | ENSG00000238009 | AL627309.1  |
| 1            | 120873    | 120932    | -            | ENSG00000238009 | AL627309.1  |
+--------------+-----------+-----------+--------------+-----------------+-------------+
Stranded PyRanges object has 95 rows and 6 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

The drop method is an alternative way of column selection wherein we specify what we want to remove, rather than what to keep:

>>> gr.print()
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome   | Start     | End       | Name       | Score     | Strand       |
| (category)   | (int64)   | (int64)   | (object)   | (int64)   | (category)   |
|--------------+-----------+-----------+------------+-----------+--------------|
| chr1         | 212609534 | 212609559 | U0         | 0         | +            |
| chr1         | 169887529 | 169887554 | U0         | 0         | +            |
| chr1         | 216711011 | 216711036 | U0         | 0         | +            |
| chr1         | 144227079 | 144227104 | U0         | 0         | +            |
| ...          | ...       | ...       | ...        | ...       | ...          |
| chrY         | 15224235  | 15224260  | U0         | 0         | -            |
| chrY         | 13517892  | 13517917  | U0         | 0         | -            |
| chrY         | 8010951   | 8010976   | U0         | 0         | -            |
| chrY         | 7405376   | 7405401   | U0         | 0         | -            |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
>>> gr.drop(['Name']).print()
+--------------+-----------+-----------+-----------+--------------+
| Chromosome   | Start     | End       | Score     | Strand       |
| (category)   | (int64)   | (int64)   | (int64)   | (category)   |
|--------------+-----------+-----------+-----------+--------------|
| chr1         | 212609534 | 212609559 | 0         | +            |
| chr1         | 169887529 | 169887554 | 0         | +            |
| chr1         | 216711011 | 216711036 | 0         | +            |
| chr1         | 144227079 | 144227104 | 0         | +            |
| ...          | ...       | ...       | ...       | ...          |
| chrY         | 15224235  | 15224260  | 0         | -            |
| chrY         | 13517892  | 13517917  | 0         | -            |
| chrY         | 8010951   | 8010976   | 0         | -            |
| chrY         | 7405376   | 7405401   | 0         | -            |
+--------------+-----------+-----------+-----------+--------------+
Stranded PyRanges object has 10,000 rows and 5 columns from 24 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Without arguments, drop will get rid of all non-core columns:

>>> gr.drop()
+--------------+-----------+-----------+--------------+
| Chromosome   | Start     | End       | Strand       |
| (category)   | (int64)   | (int64)   | (category)   |
|--------------+-----------+-----------+--------------|
| chr1         | 212609534 | 212609559 | +            |
| chr1         | 169887529 | 169887554 | +            |
| chr1         | 216711011 | 216711036 | +            |
| chr1         | 144227079 | 144227104 | +            |
| ...          | ...       | ...       | ...          |
| chrY         | 15224235  | 15224260  | -            |
| chrY         | 13517892  | 13517917  | -            |
| chrY         | 8010951   | 8010976   | -            |
| chrY         | 7405376   | 7405401   | -            |
+--------------+-----------+-----------+--------------+
Stranded PyRanges object has 10,000 rows and 4 columns from 24 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

If you want to obtain a DataFrame with certain columns rather than a PyRanges object, get a DataFrame copy through the df property, then perform pandas-style column selection. Obviously, in this case core columns are returned only if explicitly selected:

>>> gr.df [ ['Name', 'Start'] ]
     Name      Start
0      U0  212609534
1      U0  169887529
2      U0  216711011
3      U0  144227079
4      U0  148177825
...   ...        ...
9995   U0    7046809
9996   U0   15224235
9997   U0   13517892
9998   U0    8010951
9999   U0    7405376

[10000 rows x 2 columns]

Obtaining sequences

A common operation is to fetch the sequences corresponding to the intervals represented in the PyRanges object. Function get_sequence takes as input a PyRanges object and the path to a fasta file, and returns a Series containing sequences, in the same order as the intervals. It requires package pyfaidx (install with pip install pyfaidx).

In the tutorial, we saw its usage with a real genome. Let’s make a toy example here:

>>> with open('minigenome.fa', 'w') as fw:
...     fw.write('>chrZ\n')
...     fw.write('AAAGGGCCCTTTAAAGGGCCCTTTAAAGGGCCCTTT\n')
>>> sg = pr.from_dict({"Chromosome": ["chrZ", "chrZ", "chrZ", "chrZ"],
...              "Start": [0, 5, 10, 10], "End": [3, 8, 20, 20],
...              "name":["a", "a", "b", "c"],
...              "Strand":["+", "+", "+", "-"] })
>>> sg
+--------------+-----------+-----------+------------+--------------+
| Chromosome   |     Start |       End | name       | Strand       |
| (category)   |   (int64) |   (int64) | (object)   | (category)   |
|--------------+-----------+-----------+------------+--------------|
| chrZ         |         0 |         3 | a          | +            |
| chrZ         |         5 |         8 | a          | +            |
| chrZ         |        10 |        20 | b          | +            |
| chrZ         |        10 |        20 | c          | -            |
+--------------+-----------+-----------+------------+--------------+
Stranded PyRanges object has 4 rows and 5 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Note the genome sequence in the code above. Let’s run get_sequences to obtain the portions corresponding to our intervals:

>>> pr.get_sequence(sg, 'minigenome.fa')
0           AAA
1           GCC
2    TTAAAGGGCC
3    GGCCCTTTAA
dtype: object

Note that the last two intervals have identical coordinates but are on opposite strands. Function get_sequence returns the reverse complement for intervals on the negative strand.

Since the returned Series has the same length as the PyRanges object, we can assign it to a new column:

>>> sg.Sequence = pr.get_sequence(sg, 'minigenome.fa')
>>> sg
+--------------+-----------+-----------+------------+--------------+------------+
| Chromosome   |     Start |       End | name       | Strand       | Sequence   |
| (category)   |   (int64) |   (int64) | (object)   | (category)   | (object)   |
|--------------+-----------+-----------+------------+--------------+------------|
| chrZ         |         0 |         3 | a          | +            | AAA        |
| chrZ         |         5 |         8 | a          | +            | GCC        |
| chrZ         |        10 |        20 | b          | +            | TTAAAGGGCC |
| chrZ         |        10 |        20 | c          | -            | GGCCCTTTAA |
+--------------+-----------+-----------+------------+--------------+------------+
Stranded PyRanges object has 4 rows and 6 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

This allows us to filter by sequence, using pandas string methods. For example, let’s get those that start with G:

>>> sg[sg.Sequence.str.startswith('G')]
+--------------+-----------+-----------+------------+--------------+------------+
| Chromosome   |     Start |       End | name       | Strand       | Sequence   |
| (category)   |   (int64) |   (int64) | (object)   | (category)   | (object)   |
|--------------+-----------+-----------+------------+--------------+------------|
| chrZ         |         5 |         8 | a          | +            | GCC        |
| chrZ         |        10 |        20 | c          | -            | GGCCCTTTAA |
+--------------+-----------+-----------+------------+--------------+------------+
Stranded PyRanges object has 2 rows and 6 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Let’s get those which contain a CC and AA dinucleotides separated by 1-3 nucleotides:

>>> sg[sg.Sequence.str.contains(r'CC.{1,3}AA', regex=True)]
+--------------+-----------+-----------+------------+--------------+------------+
| Chromosome   |     Start |       End | name       | Strand       | Sequence   |
| (category)   |   (int64) |   (int64) | (object)   | (category)   | (object)   |
|--------------+-----------+-----------+------------+--------------+------------|
| chrZ         |        10 |        20 | c          | -            | GGCCCTTTAA |
+--------------+-----------+-----------+------------+--------------+------------+
Stranded PyRanges object has 1 rows and 6 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Function get_sequence will treat each interval independently. Often, you want to get the sequence of an mRNA, i.e. concatenating exons. Function get_transcript_sequence serves this purpose, and employs argument group_by to group the exons into mRNAs:

>>> pr.get_transcript_sequence(sg, group_by='name', path='minigenome.fa')
  name    Sequence
0    a      AAAGCC
1    b  TTAAAGGGCC
2    c  GGCCCTTTAA

Note that this returns a pandas DataFrame with a row per exon group: its shape is different from the original PyRanges.

Operating with data

In this section, we give an overview of methods to modify the data in PyRanges. Changing row order Methods sort allows to sort intervals, i.e. altering the order of rows in the PyRanges object. When run without arguments, orders interval by increasing Start. Commonly, genomic annotation files are sorted in this way.

>>> sg = pr.from_dict({"Chromosome": ["chrA", "chrA", "chrB", "chrB", "chrB"],
...              "Start": [55, 20, 65, 35, 75],
...              "End": [88, 30, 75, 45, 85],
...              "name":["a", "a", "b", "c", "c"],
...              "Strand":["+", "+", "+", "-", "-"] })
>>> sg
+--------------+-----------+-----------+------------+--------------+
| Chromosome   |     Start |       End | name       | Strand       |
| (category)   |   (int64) |   (int64) | (object)   | (category)   |
|--------------+-----------+-----------+------------+--------------|
| chrA         |        55 |        88 | a          | +            |
| chrA         |        20 |        30 | a          | +            |
| chrB         |        65 |        75 | b          | +            |
| chrB         |        35 |        45 | c          | -            |
| chrB         |        75 |        85 | c          | -            |
+--------------+-----------+-----------+------------+--------------+
Stranded PyRanges object has 5 rows and 5 columns from 2 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
>>> sg.sort()
+--------------+-----------+-----------+------------+--------------+
| Chromosome   |     Start |       End | name       | Strand       |
| (category)   |   (int64) |   (int64) | (object)   | (category)   |
|--------------+-----------+-----------+------------+--------------|
| chrA         |        20 |        30 | a          | +            |
| chrA         |        55 |        88 | a          | +            |
| chrB         |        65 |        75 | b          | +            |
| chrB         |        35 |        45 | c          | -            |
| chrB         |        75 |        85 | c          | -            |
+--------------+-----------+-----------+------------+--------------+
Stranded PyRanges object has 5 rows and 5 columns from 2 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Remember that sorting is performed separately for each internal table: intervals on different chromosome/strands won’t ever cross each other. To have all intervals sorted, work with a DataFrame object instead.

For intervals on the negative strand, it may be convenient to sort in the opposite order, since for them the leftmost exon is actually the last one in the mRNA. Instead of having to split the PyRanges object for this task, you may run sort with special argument “5”, which will sort intervals in 5’ to 3’ order:

>>> sg.sort('5')
+--------------+-----------+-----------+------------+--------------+
| Chromosome   |     Start |       End | name       | Strand       |
| (category)   |   (int64) |   (int64) | (object)   | (category)   |
|--------------+-----------+-----------+------------+--------------|
| chrA         |        20 |        30 | a          | +            |
| chrA         |        55 |        88 | a          | +            |
| chrB         |        65 |        75 | b          | +            |
| chrB         |        75 |        85 | c          | -            |
| chrB         |        35 |        45 | c          | -            |
+--------------+-----------+-----------+------------+--------------+
Stranded PyRanges object has 5 rows and 5 columns from 2 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Sorting may also take any column name, or a list of colum names, to sort rows by their value:

>>> ag = pr.from_dict({"Chromosome": "chrX",
...              "Start": [55, 65, 20, 35, 75],
...              "End": [88, 75, 30, 45, 85],
...              "Strand":["+", "+", "+", "+", "+"],
...              "col1":[1, 4, 4, 2, 2],
...               })
>>> ag
+--------------+-----------+-----------+--------------+-----------+
| Chromosome   |     Start |       End | Strand       |      col1 |
| (category)   |   (int64) |   (int64) | (category)   |   (int64) |
|--------------+-----------+-----------+--------------+-----------|
| chrX         |        55 |        88 | +            |         1 |
| chrX         |        65 |        75 | +            |         4 |
| chrX         |        20 |        30 | +            |         4 |
| chrX         |        35 |        45 | +            |         2 |
| chrX         |        75 |        85 | +            |         2 |
+--------------+-----------+-----------+--------------+-----------+
Stranded PyRanges object has 5 rows and 5 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
>>> ag.sort('col1')
+--------------+-----------+-----------+--------------+-----------+
| Chromosome   |     Start |       End | Strand       |      col1 |
| (category)   |   (int64) |   (int64) | (category)   |   (int64) |
|--------------+-----------+-----------+--------------+-----------|
| chrX         |        55 |        88 | +            |         1 |
| chrX         |        35 |        45 | +            |         2 |
| chrX         |        75 |        85 | +            |         2 |
| chrX         |        65 |        75 | +            |         4 |
| chrX         |        20 |        30 | +            |         4 |
+--------------+-----------+-----------+--------------+-----------+
Stranded PyRanges object has 5 rows and 5 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
>>> ag.sort(['col1', 'End'])
+--------------+-----------+-----------+--------------+-----------+
| Chromosome   |     Start |       End | Strand       |      col1 |
| (category)   |   (int64) |   (int64) | (category)   |   (int64) |
|--------------+-----------+-----------+--------------+-----------|
| chrX         |        55 |        88 | +            |         1 |
| chrX         |        35 |        45 | +            |         2 |
| chrX         |        75 |        85 | +            |         2 |
| chrX         |        20 |        30 | +            |         4 |
| chrX         |        65 |        75 | +            |         4 |
+--------------+-----------+-----------+--------------+-----------+
Stranded PyRanges object has 5 rows and 5 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.