Loading/Creating PyRanges
A PyRanges object can be built like a Pandas DataFrame, but genomic location columns (Chromosome, Start, End) are mandatory. Refer to https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html for more information on how to create a DataFrame. In alternative, PyRanges can be read from a file in bed, gtf, gff3 or bam format.
PyRanges are created in the following ways:
from a pandas dataframe
from a dictionary with the column names as keys and iterables as values
from a file, using pyranges1 readers
concatenating existing PyRanges objects
From a DataFrame
If you instantiate a PyRanges object from a dataframe, it should at least contain the columns Chromosome, Start and End. Coordinates follow the python standard (0-based, start included, end excluded). A column called Strand is optional. Any other columns in the dataframe are carried over as metadata.
>>> import pandas as pd, pyranges1 as pr, numpy as np
>>> df=pd.DataFrame(
... {'Chromosome':['chr1', 'chr1', 'chr1', 'chr3'],
... 'Start': [5, 20, 80, 10],
... 'End': [10, 28, 95, 38],
... 'Strand':['+', '+', '-', '+'],
... 'title': ['a', 'b', 'c', 'd']}
... )
>>> df
Chromosome Start End Strand title
0 chr1 5 10 + a
1 chr1 20 28 + b
2 chr1 80 95 - c
3 chr3 10 38 + d
To instantiate PyRanges from a dataframe, provide it as argument to the PyRanges constructor:
>>> p=pr.PyRanges(df)
>>> p
index | Chromosome Start End Strand title
int64 | str int64 int64 str str
------- --- ------------ ------- ------- -------- -------
0 | chr1 5 10 + a
1 | chr1 20 28 + b
2 | chr1 80 95 - c
3 | chr3 10 38 + d
PyRanges with 4 rows, 5 columns, and 1 index columns.
Contains 2 chromosomes and 2 strands.
From a Dictionary
You can instantiate a PyRanges object using a dictionary:
>>> gr = pr.PyRanges({'Chromosome': ['chr1', 'chr1', 'chr1', 'chr3'],
... 'Start': [5, 20, 80, 10],
... 'End': [10, 28, 95, 38],
... 'Strand': ['+', '+', '-', '+'],
... 'title': ['a', 'b', 'c', 'd']})
>>> gr
index | Chromosome Start End Strand title
int64 | str int64 int64 str str
------- --- ------------ ------- ------- -------- -------
0 | chr1 5 10 + a
1 | chr1 20 28 + b
2 | chr1 80 95 - c
3 | chr3 10 38 + d
PyRanges with 4 rows, 5 columns, and 1 index columns.
Contains 2 chromosomes and 2 strands.
Both the {} and the dict constructors can be used to create a dictionary:
>>> gr2 = pr.PyRanges(dict(Chromosome=['chr10', 'chr10', 'chr1', 'chr3'],
... Start=[55, 250, 80, 100],
... End=[150, 258, 95, 380]))
>>> gr2
index | Chromosome Start End
int64 | str int64 int64
------- --- ------------ ------- -------
0 | chr10 55 150
1 | chr10 250 258
2 | chr1 80 95
3 | chr3 100 380
PyRanges with 4 rows, 3 columns, and 1 index columns.
Contains 3 chromosomes.
As in the creation of a Dataframe, each list in the dictionary must have the same length, i.e. the number of rows in the PyRanges. Also, it may be any iterable, including pd.Series or np.array. Alternatively, if a string or scalar is provided, it is broadcasted to the length of the other columns:
>>> gr3 = pr.PyRanges(dict(Chromosome=pd.Series(['chr10', 'chr10', 'chr1', 'chr3']),
... Start=np.array([55, 250, 80, 100]),
... End=[150, 258, 95, 380],
... Strand='+'))
>>> gr3
index | Chromosome Start End Strand
int64 | str int64 int64 str
------- --- ------------ ------- ------- --------
0 | chr10 55 150 +
1 | chr10 250 258 +
2 | chr1 80 95 +
3 | chr3 100 380 +
PyRanges with 4 rows, 4 columns, and 1 index columns.
Contains 3 chromosomes and 1 strands.
Loading from a file
The pyranges1 library can create PyRanges from gff3 common file formats, namely gtf/gff, gff3, bed and bam (see
read_bed, read_gtf,
read_gff3, read_bam).
The documentation of readers is available in the pyranges1 module.
Note that these files may encode coordinates with different conventions (e.g. GTF: 1-based, start and end included).
When instancing a PyRanges object they are converted to the python convention.
>>> ensembl_path = pr.example_data.files['ensembl.gtf'] # example file
>>> gr = pr.read_gtf(ensembl_path)
>>> gr
index | Chromosome Source Feature Start End Score Strand Frame ...
int64 | category category category int64 int64 str category category ...
------- --- ------------ ---------- ---------- ------- ------- ------- ---------- ---------- -----
0 | 1 havana gene 11868 14409 . + . ...
1 | 1 havana transcript 11868 14409 . + . ...
2 | 1 havana exon 11868 12227 . + . ...
3 | 1 havana exon 12612 12721 . + . ...
... | ... ... ... ... ... ... ... ... ...
8 | 1 ensembl transcript 120724 133723 . - . ...
9 | 1 ensembl exon 133373 133723 . - . ...
10 | 1 ensembl exon 129054 129223 . - . ...
11 | 1 ensembl exon 120873 120932 . - . ...
PyRanges with 12 rows, 23 columns, and 1 index columns. (15 columns not shown: "gene_id", "gene_version", "gene_name", ...).
Contains 1 chromosomes and 2 strands.
To read bam files, the optional bamread-library must be installed, with
pip install bamread
Let’s read a bam file:
>>> bam_path = pr.example_data.files['smaller.bam'] # example file
>>> gr4 = pr.read_bam(bam_path)
>>> gr4
index | Chromosome Start End Strand Flag
int64 | category int64 int64 category uint16
------- --- ------------ -------- -------- ---------- --------
0 | chr1 887771 887796 - 16
1 | chr1 994660 994685 - 16
2 | chr1 1041102 1041127 + 0
3 | chr1 1770383 1770408 - 16
... | ... ... ... ... ...
96 | chr1 18800901 18800926 + 0
97 | chr1 18800901 18800926 + 0
98 | chr1 18855123 18855148 - 16
99 | chr1 19373470 19373495 + 0
PyRanges with 100 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 2 strands.
read_bam takes various arguments, such as sparse.
With sparse=True (default), only the columns ['Chromosome', 'Start', 'End', 'Strand', 'Flag']
are fetched. Setting sparse=False additionally gives you the columns
['QueryStart', 'QueryEnd', 'QuerySequence', 'Name', 'Cigar', 'Quality'], but is more time and memory-consuming:
>>> pr.read_bam(bam_path, sparse=False)
index | Chromosome Start End Strand Flag QueryStart QueryEnd QuerySequence Name ...
int64 | category int64 int64 category uint16 int64 int64 object str ...
------- --- ------------ -------- -------- ---------- -------- ------------ ---------- --------------- ------ -----
0 | chr1 887771 887796 - 16 0 25 None U0 ...
1 | chr1 994660 994685 - 16 0 25 None U0 ...
2 | chr1 1041102 1041127 + 0 0 25 None U0 ...
3 | chr1 1770383 1770408 - 16 0 25 None U0 ...
... | ... ... ... ... ... ... ... ... ... ...
96 | chr1 18800901 18800926 + 0 0 25 None U0 ...
97 | chr1 18800901 18800926 + 0 0 25 None U0 ...
98 | chr1 18855123 18855148 - 16 0 25 None U0 ...
99 | chr1 19373470 19373495 + 0 0 25 None U0 ...
PyRanges with 100 rows, 11 columns, and 1 index columns. (2 columns not shown: "Cigar", "Quality").
Contains 1 chromosomes and 2 strands.
To load tabular file in any format, you can use pandas read_csv method and then pass the resulting dataframe to the
PyRanges constructor. Be aware of the coordinate convention of the file you load, and make sure that the
dataframe has aptly named columns.
Concatenating PyRanges
Analogously to pandas.concat, pyranges1.concat() can be used to concatenate PyRanges objects, i.e.
stack rows of two or more PyRanges to create a new PyRanges object.
>>> gr1 = pr.PyRanges({'Chromosome': ['chr1', 'chr1', 'chr1', 'chr3'],
... 'Start': [5, 20, 80, 10],
... 'End': [10, 28, 95, 38],
... 'Strand': ['+', '+', '-', '+'],
... 'title': ['a', 'b', 'c', 'd']})
>>> gr2 = pr.PyRanges({'Chromosome': ['chr1', 'chr1', 'chr1', 'chr3'],
... 'Start': [5, 20, 80, 10],
... 'End': [10, 28, 95, 38],
... 'Strand': ['+', '+', '-', '+'],
... 'title': ['a', 'b', 'c', 'd']})
>>> gr3 = pr.concat([gr1, gr2])
>>> gr3
index | Chromosome Start End Strand title
int64 | str int64 int64 str str
------- --- ------------ ------- ------- -------- -------
0 | chr1 5 10 + a
1 | chr1 20 28 + b
2 | chr1 80 95 - c
3 | chr3 10 38 + d
0 | chr1 5 10 + a
1 | chr1 20 28 + b
2 | chr1 80 95 - c
3 | chr3 10 38 + d
PyRanges with 8 rows, 5 columns, and 1 index columns (with 4 index duplicates).
Contains 2 chromosomes and 2 strands.
Note that this may result in index duplicates, which can be remedied by pandas reset_index method.
>>> pr.concat([gr1, gr2]).reset_index(drop=True)
index | Chromosome Start End Strand title
int64 | str int64 int64 str str
------- --- ------------ ------- ------- -------- -------
0 | chr1 5 10 + a
1 | chr1 20 28 + b
2 | chr1 80 95 - c
3 | chr3 10 38 + d
4 | chr1 5 10 + a
5 | chr1 20 28 + b
6 | chr1 80 95 - c
7 | chr3 10 38 + d
PyRanges with 8 rows, 5 columns, and 1 index columns.
Contains 2 chromosomes and 2 strands.
Data for testing
For testing purposes, pyranges1 provides some data in pr.example_data. See an overview with:
>>> pr.example_data
Available example data:
-----------------------
example_data.chipseq : Example ChIP-seq data.
example_data.chipseq_background : Example ChIP-seq data.
example_data.chromsizes : Example chromsizes data (hg19).
example_data.ensembl_gtf : Example gtf file from Ensembl.
example_data.interpro_hits : Example of InterPro protein hits.
example_data.rfam_hits : Example of RNA motifs (Rfam) as 1-based dataframe.
example_data.f1 : Example bed file.
example_data.f2 : Example bed file.
example_data.aorta : Example ChIP-seq data.
example_data.aorta2 : Example ChIP-seq data.
example_data.ncbi_gff : Example NCBI GFF data.
example_data.ncbi_fasta : Example NCBI fasta.
example_data.files : Return a dict of basenames to file paths of available data.
You can load the data with this syntax:
>>> cs = pr.example_data.chipseq
>>> cs
index | Chromosome Start End Name Score Strand
int64 | category int64 int64 str int64 category
------- --- ------------ --------- --------- ------ ------- ----------
0 | chr8 28510032 28510057 U0 0 -
1 | chr7 107153363 107153388 U0 0 -
2 | chr5 135821802 135821827 U0 0 -
3 | chr14 19418999 19419024 U0 0 -
... | ... ... ... ... ... ...
16 | chr9 120803448 120803473 U0 0 +
17 | chr6 89296757 89296782 U0 0 -
18 | chr1 194245558 194245583 U0 0 +
19 | chr8 57916061 57916086 U0 0 +
PyRanges with 20 rows, 6 columns, and 1 index columns.
Contains 15 chromosomes and 2 strands.
On the on other hand, you can create random intervals using pyranges1.random().
By default, the data refers to the human genome (hg19):
>>> pr.random(n=5, length=50, seed=123)
index | Chromosome Start End Strand
int64 | str int64 int64 str
------- --- ------------ --------- --------- --------
0 | chr12 108700348 108700398 +
1 | chr1 230144267 230144317 -
2 | chr3 54767920 54767970 +
3 | chr3 162329749 162329799 +
4 | chr3 176218669 176218719 +
PyRanges with 5 rows, 4 columns, and 1 index columns.
Contains 3 chromosomes and 2 strands.