Tutorial
PyRanges objects represent sequence intervals (“ranges”) such as genomic regions. Examples include gene annotations, mapped reads, protein domains, and results from Blast or analogous software. PyRanges provides convenient methods to load data in GFF, GTF, BED, BAM format. In this tutorial, we refer to PyRanges objects for their typical use case, i.e. representing genomic intervals, but the same concepts and methods may be applied to RNA or protein sequences.
Every interval in a PyRanges object is defined, at minimum, by its chromosome and start and end coordinates. Optionally, the strand (+ or -) can also be present; if so, the PyRanges object is “Stranded”. The ranges can additionally have an arbitrary number of meta-data fields, i.e. columns associated with them.
Note that PyRanges follows the standard python convention:
coordinates are 0-based
in each interval, the start is included and the end is excluded.
Some genomic coordinate formats (GFF, GTF) use a 1-based, start-and-end-included format. PyRanges takes care of converting between these conventions when loading and writing files in these formats. The data in PyRanges objects are stored as Pandas dataframes. This means that the vast Python ecosystem for high-performance scientific computing is available to manipulate the data in PyRanges objects. While not strictly necessary, having familiarity with pandas greatly facilitates the use of PyRanges. In this tutorial and in how-to pages, we will sometimes mix PyRanges and pandas functionalities.
Getting started
For this tutorial, we will use real data consisting of the genome sequence (fasta format) and annotation (GFF3 format) of a worm with a very small but complex genome (Dimorphilus gyrociliatus, assembly ID GCA_904063045.1). These data were modified for this tutorial only to shorten IDs.
Download and unpack tutorial data with:
curl -O https://mariottigenomicslab.bio.ub.edu/pyranges_data/pyranges_tutorial_data.tar.gz
tar zxf pyranges_tutorial_data.tar.gz
We recommend using ipython or Jupyter for this tutorial. Besides pyranges and some of its dependencies, we will use the optional module pyfaidx. Make sure to install it with:
pip install pyfaidx
Or with:
conda install -c bioconda pyfaidx
Loading and accessing pyranges objects
Let’s import libraries and load the annotation into a PyRanges object called ann
:
>>> import pyranges as pr, pandas as pd
>>> ann = pr.read_gff3('Dgyro_annotation.gff')
>>> ann
+--------------+------------+--------------+-----------+-----------+------------+--------------+------------+--------------------+---------------+------------+-------------+-------+
| Chromosome | Source | Feature | Start | End | Score | Strand | Frame | ID | Dbxref | gbkey | mol_type | +13 |
| (category) | (object) | (category) | (int64) | (int64) | (object) | (category) | (object) | (object) | (object) | (object) | (object) | ... |
|--------------+------------+--------------+-----------+-----------+------------+--------------+------------+--------------------+---------------+------------+-------------+-------|
| 0001.1 | EMBL | region | 0 | 3609319 | . | + | . | 0001.1:1..3609319 | taxon:2664684 | Src | genomic DNA | ... |
| 0001.1 | EMBL | gene | 7326 | 7606 | . | + | . | gene-DGYR_LOCUS1 | nan | Gene | nan | ... |
| 0001.1 | EMBL | mRNA | 7326 | 7606 | . | + | . | rna-DGYR_LOCUS1 | nan | mRNA | nan | ... |
| 0001.1 | EMBL | exon | 7326 | 7606 | . | + | . | exon-DGYR_LOCUS1-1 | nan | mRNA | nan | ... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 0346.1 | EMBL | region | 0 | 1411 | . | + | . | 0346.1:1..1411 | taxon:2664684 | Src | genomic DNA | ... |
| 0347.1 | EMBL | region | 0 | 1094 | . | + | . | 0347.1:1..1094 | taxon:2664684 | Src | genomic DNA | ... |
| 0348.1 | EMBL | region | 0 | 667 | . | + | . | 0348.1:1..667 | taxon:2664684 | Src | genomic DNA | ... |
| 0349.1 | EMBL | region | 0 | 641 | . | + | . | 0349.1:1..641 | taxon:2664684 | Src | genomic DNA | ... |
+--------------+------------+--------------+-----------+-----------+------------+--------------+------------+--------------------+---------------+------------+-------------+-------+
Stranded PyRanges object has 241,137 rows and 25 columns from 349 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
13 hidden columns: note, Name, gene_biotype, locus_tag, Parent, Note, standard_name, product, protein_id, pseudo, partial, start_range, end_range
The ann
object has lots of columns, most of which are not displayed because of space. Here’s the full list:
>>> ann.columns
Index(['Chromosome', 'Source', 'Feature', 'Start', 'End', 'Score', 'Strand',
'Frame', 'ID', 'Dbxref', 'gbkey', 'mol_type', 'note', 'Name',
'gene_biotype', 'locus_tag', 'Parent', 'Note', 'standard_name',
'product', 'protein_id', 'pseudo', 'partial', 'start_range',
'end_range'],
dtype='object')
Let’s select only certain columns:
>>> ann = ann[ ['Feature', 'Parent', 'ID'] ]
>>> ann
+--------------+--------------+-----------+-----------+--------------+------------------+--------------------+
| Chromosome | Feature | Start | End | Strand | Parent | ID |
| (category) | (category) | (int64) | (int64) | (category) | (object) | (object) |
|--------------+--------------+-----------+-----------+--------------+------------------+--------------------|
| 0001.1 | region | 0 | 3609319 | + | nan | 0001.1:1..3609319 |
| 0001.1 | gene | 7326 | 7606 | + | nan | gene-DGYR_LOCUS1 |
| 0001.1 | mRNA | 7326 | 7606 | + | gene-DGYR_LOCUS1 | rna-DGYR_LOCUS1 |
| 0001.1 | exon | 7326 | 7606 | + | rna-DGYR_LOCUS1 | exon-DGYR_LOCUS1-1 |
| ... | ... | ... | ... | ... | ... | ... |
| 0346.1 | region | 0 | 1411 | + | nan | 0346.1:1..1411 |
| 0347.1 | region | 0 | 1094 | + | nan | 0347.1:1..1094 |
| 0348.1 | region | 0 | 667 | + | nan | 0348.1:1..667 |
| 0349.1 | region | 0 | 641 | + | nan | 0349.1:1..641 |
+--------------+--------------+-----------+-----------+--------------+------------------+--------------------+
Stranded PyRanges object has 241,137 rows and 7 columns from 349 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
As seen above, column selection syntax is analogous to pandas. However, a difference is that PyRanges retained the essential columns Chromosome, Start, End, Strand even though we did not select them.
The Chromosome column can take any value among the sequence names in the genome assembly. Only for the best quality assemblies it corresponds to actual chromosomes, and in other cases it is contigs or scaffolds; for simplicity, here we refer to it as chromosomes. In a fasta file, the sequence name is the first word of a header line (i.e. those starting with “>”). We can have a peek in the assembly in bash:
grep ">" Dgyro_genome.fa | head
>0001.1 Dimorphilus gyrociliatus genome assembly, contig: scaffold001, whole genome shotgun sequence
>0002.1 Dimorphilus gyrociliatus genome assembly, contig: scaffold002, whole genome shotgun sequence
>0003.1 Dimorphilus gyrociliatus genome assembly, contig: scaffold003, whole genome shotgun sequence
>0004.1 Dimorphilus gyrociliatus genome assembly, contig: scaffold004, whole genome shotgun sequence
>0005.1 Dimorphilus gyrociliatus genome assembly, contig: scaffold005, whole genome shotgun sequence
>0006.1 Dimorphilus gyrociliatus genome assembly, contig: scaffold006, whole genome shotgun sequence
>0007.1 Dimorphilus gyrociliatus genome assembly, contig: scaffold007, whole genome shotgun sequence
>0008.1 Dimorphilus gyrociliatus genome assembly, contig: scaffold008, whole genome shotgun sequence
>0009.1 Dimorphilus gyrociliatus genome assembly, contig: scaffold009, whole genome shotgun sequence
>0010.1 Dimorphilus gyrociliatus genome assembly, contig: scaffold010, whole genome shotgun sequence
Genomic annotations often contain information for diverse entities, such as genes, mRNAs, exons, CDS, etc. In GFF files, the entity type is encoded in the Feature column. In pyranges, you use the dot notation to fetch an individual column, which is technically a pandas Series:
>>> ann.Feature
0 region
1 gene
2 mRNA
3 exon
4 CDS
...
0 region
0 region
0 region
0 region
0 region
Name: Feature, Length: 241137, dtype: category
Categories (6, object): ['CDS', 'exon', 'gene', 'mRNA', 'pseudogene', 'region']
Let’s focus on a subset of the annotation: CDS intervals, corresponding to coding sequences.
We filter rows and create a new PyRanges object called cds
:
>>> selector = (ann.Feature == 'CDS')
>>> cds = ann [selector]
We used another syntax familiar to pandas users. The object selector
is a Series of boolean values, so it can be used to index PyRanges.
Let’s further reduce the width of the cds object. We showcase an alternative method for column selection: the method drop lets us choose which columns to discard.
>>> cds = cds.drop( ['Feature', 'Parent'] )
>>> cds
+--------------+-----------+-----------+--------------+------------------+
| Chromosome | Start | End | Strand | ID |
| (category) | (int64) | (int64) | (category) | (object) |
|--------------+-----------+-----------+--------------+------------------|
| 0001.1 | 7327 | 7606 | + | cds-CAD5110615.1 |
| 0001.1 | 46377 | 47603 | + | cds-CAD5110625.1 |
| 0001.1 | 48406 | 49448 | + | cds-CAD5110625.1 |
| 0001.1 | 46377 | 47603 | + | cds-CAD5110626.1 |
| ... | ... | ... | ... | ... |
| 0117.1 | 26816 | 27881 | - | cds-CAD5126988.1 |
| 0117.1 | 36183 | 38697 | - | cds-CAD5126989.1 |
| 0117.1 | 39309 | 39450 | - | cds-CAD5126990.1 |
| 0117.1 | 38911 | 39256 | - | cds-CAD5126990.1 |
+--------------+-----------+-----------+--------------+------------------+
Stranded PyRanges object has 100,040 rows and 5 columns from 117 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
The interface shown so far is analogous to pandas. Additionally, pyranges offers a non-pandas syntax for selecting intervals in a genomic region of interest (i.e. region retrieval). The code below will show only intervals completely included in the specified position range in the requested chromosome:
>>> cds['0002.1', 145000:150000]
+--------------+-----------+-----------+--------------+------------------+
| Chromosome | Start | End | Strand | ID |
| (category) | (int64) | (int64) | (category) | (object) |
|--------------+-----------+-----------+--------------+------------------|
| 2.1 | 146123 | 146246 | + | cds-CAD5111630.1 |
| 2.1 | 146306 | 147077 | + | cds-CAD5111630.1 |
| 2.1 | 147830 | 147976 | + | cds-CAD5111631.1 |
| 2.1 | 148191 | 148297 | + | cds-CAD5111631.1 |
| 2.1 | 148360 | 148489 | + | cds-CAD5111631.1 |
| 2.1 | 145002 | 145116 | - | cds-CAD5111629.1 |
| 2.1 | 145176 | 145262 | - | cds-CAD5111629.1 |
| 2.1 | 145320 | 145435 | - | cds-CAD5111629.1 |
+--------------+-----------+-----------+--------------+------------------+
Stranded PyRanges object has 8 rows and 5 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
The syntax for region retrieval may consists of:
chromosome
chromosome, position slice (as the example above)
chromosome, strand, position slice
So, for example, this is also valid:
>>> cds['0002.1', "-", 145000:150000]
+--------------+-----------+-----------+--------------+------------------+
| Chromosome | Start | End | Strand | ID |
| (category) | (int64) | (int64) | (category) | (object) |
|--------------+-----------+-----------+--------------+------------------|
| 2.1 | 145002 | 145116 | - | cds-CAD5111629.1 |
| 2.1 | 145176 | 145262 | - | cds-CAD5111629.1 |
| 2.1 | 145320 | 145435 | - | cds-CAD5111629.1 |
+--------------+-----------+-----------+--------------+------------------+
Stranded PyRanges object has 3 rows and 5 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
It is important to differentiate between Stranded and Unstranded PyRanges objects. When a Strand column is present and all its values are “+” or “-”, the object is Stranded. When there are invalid values (e.g. “.”) or the Strand column is absent, it is Unstranded. You can check whether an interval is Stranded with:
>>> cds.stranded
True
Certain pyranges methods require a Stranded input.
While the annotation used in this tutorial is naturally Stranded, others may not be.
If necessary, you may use method make_stranded
to transform all invalid Strand values to “+” or remove them.
Working with groups of exons
Multi-exonic genes are represented with multiple rows in PyRanges. In this tutorial, the ID
column links the
intervals belonging to the same CDS: these rows have the same ID value.
While this concept applies to all annotations, files from different sources may use different column names for this purpose (e.g. transcript_id).
Note that here we focus on CDS regions. These may encompass multiple exons, but they do not span the whole mRNA: the 5’UTRs and 3’UTRs are not included.
Next, we will examine the first and last codon of annotated CDSs. We will obtain their genomic coordinate, then fetch their sequence.
Method spliced_subsequence
allows to obtain a subregion of groups of intervals. The code below derives the first codon of each CDS group (grouping is defined by their ID):
>>> first=cds.spliced_subsequence(start=0, end=3, by='ID')
>>> first
+--------------+-----------+-----------+--------------+------------------+
| Chromosome | Start | End | Strand | ID |
| (category) | (int64) | (int64) | (category) | (object) |
|--------------+-----------+-----------+--------------+------------------|
| 0001.1 | 7327 | 7330 | + | cds-CAD5110615.1 |
| 0001.1 | 46377 | 46380 | + | cds-CAD5110625.1 |
| 0001.1 | 46377 | 46380 | + | cds-CAD5110626.1 |
| 0001.1 | 60099 | 60102 | + | cds-CAD5110627.1 |
| ... | ... | ... | ... | ... |
| 0117.1 | 38694 | 38697 | - | cds-CAD5126989.1 |
| 0117.1 | 27878 | 27881 | - | cds-CAD5126988.1 |
| 0117.1 | 21258 | 21261 | - | cds-CAD5126985.1 |
| 0117.1 | 14274 | 14277 | - | cds-CAD5126984.1 |
+--------------+-----------+-----------+--------------+------------------+
Stranded PyRanges object has 16,391 rows and 5 columns from 117 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
Let’s fetch the sequence for each of these intervals from our genome fasta file.
The function get_sequence
returns one sequence per interval, which we assign to a new column of our pyranges object:
>>> first.Sequence = pr.get_sequence(first, 'Dgyro_genome.fa')
>>> first
+--------------+-----------+-----------+--------------+------------------+------------+
| Chromosome | Start | End | Strand | ID | Sequence |
| (category) | (int64) | (int64) | (category) | (object) | (object) |
|--------------+-----------+-----------+--------------+------------------+------------|
| 0001.1 | 7327 | 7330 | + | cds-CAD5110615.1 | ATG |
| 0001.1 | 46377 | 46380 | + | cds-CAD5110625.1 | ATG |
| 0001.1 | 46377 | 46380 | + | cds-CAD5110626.1 | ATG |
| 0001.1 | 60099 | 60102 | + | cds-CAD5110627.1 | ATG |
| ... | ... | ... | ... | ... | ... |
| 0117.1 | 38694 | 38697 | - | cds-CAD5126989.1 | ATT |
| 0117.1 | 27878 | 27881 | - | cds-CAD5126988.1 | ATG |
| 0117.1 | 21258 | 21261 | - | cds-CAD5126985.1 | ATG |
| 0117.1 | 14274 | 14277 | - | cds-CAD5126984.1 | ATG |
+--------------+-----------+-----------+--------------+------------------+------------+
Stranded PyRanges object has 16,391 rows and 6 columns from 117 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
The Sequence
column is a pandas Series containing strings. We see that the starting codon is ATG in most cases, as expected.
When we check the length of the sequences, we notice that some are not 3-letter long:
>>> (first.Sequence.str.len() == 3 ).all()
False
Let’s look at those sequences, using a row selector as before:
>>> first [ first.Sequence.str.len() != 3 ]
+--------------+-----------+-----------+--------------+------------------+------------+
| Chromosome | Start | End | Strand | ID | Sequence |
| (category) | (int64) | (int64) | (category) | (object) | (object) |
|--------------+-----------+-----------+--------------+------------------+------------|
| 0001.1 | 667699 | 667700 | - | cds-CAD5110773.1 | a |
| 0001.1 | 667641 | 667643 | - | cds-CAD5110773.1 | TG |
| 0002.1 | 2632107 | 2632109 | - | cds-CAD5112293.1 | AT |
| 0002.1 | 2631440 | 2631441 | - | cds-CAD5112293.1 | G |
| ... | ... | ... | ... | ... | ... |
| 0024.1 | 1091339 | 1091341 | - | cds-CAD5125104.1 | AT |
| 0024.1 | 1091163 | 1091164 | - | cds-CAD5125104.1 | G |
| 0025.1 | 39753 | 39755 | - | cds-CAD5125115.1 | at |
| 0025.1 | 39692 | 39693 | - | cds-CAD5125115.1 | g |
+--------------+-----------+-----------+--------------+------------------+------------+
Stranded PyRanges object has 26 rows and 6 columns from 11 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
In some cases the starting codon is split between two exons. This is uncommon, but we are looking at all protein coding genes of a species, so it is expected at least in a few cases. How do we get the full codon sequence?
Instead of get_sequence
, let’s use get_transcript_sequence
, which returns the concatenated sequence of a group of intervals,
i.e. joining exons together. The sequence is given 5’ to 3’.
>>> seq_first = pr.get_transcript_sequence(
... first,
... group_by='ID',
... path='Dgyro_genome.fa'
... )
>>> seq_first
ID Sequence
0 cds-CAD5110614.1 ATG
1 cds-CAD5110615.1 ATG
2 cds-CAD5110616.1 atg
3 cds-CAD5110617.1 atg
4 cds-CAD5110618.1 ATG
... ... ...
16373 cds-DGYR_LOCUS9540 atg
16374 cds-DGYR_LOCUS9596 ATG
16375 cds-DGYR_LOCUS9732 ATG
16376 cds-DGYR_LOCUS980 ATG
16377 cds-DGYR_LOCUS9980 ATG
[16378 rows x 2 columns]
seq_first
is not a PyRanges object, but a pandas DataFrame. It has a column for the group (ID) and one for the sequence.
Here we confirm the sequence length is always 3:
>>> (seq_first.Sequence.str.len()==3).all()
True
Finally, let’s quantify how many start codons are ATG, using a bit of pandas magic.
First, we make sure the whole sequence is in uppercase characters.
Then, we make a boolean Series is_atg
which has True corresponding to the ATG codons,
then we sum its values to count the instances of True, creating variable n_atg
.
We also store the IDs of the CDSs with ATG codons in the variable is_atg_ids
. Finally, we print a summary:
>>> seq_first.Sequence = seq_first.Sequence.str.upper()
>>> is_atg = (seq_first.Sequence == 'ATG')
>>> is_atg_ids = seq_first[is_atg].ID
>>> n_atg = is_atg.sum()
>>> print(f'There are {n_atg} ATG start codons out of '
... f'{len(seq_first)} CDSs => {n_atg/len(seq_first):.2%}')
There are 16339 ATG start codons out of 16378 CDSs => 99.76%
Now, we want to perform an analogous analysis with stop codons.
First, we get the a pyranges object of the last codon of each CDS.
Conveniently, the method spliced_subsequence
accepts negative arguments to count from the 3’,
so we can obtain the last three nucleotides of CDSs with:
>>> last = cds.spliced_subsequence(start=-3, by='ID')
By not providing an end
argument, we requested intervals that reach the very end of each CDS group.
Let’s get their sequence as before, then use pandas function value_counts
to count them:
>>> seq_last = pr.get_transcript_sequence(last, 'ID',
... 'Dgyro_genome.fa')
>>> seq_last.Sequence = seq_last.Sequence.str.upper()
>>> seq_last.Sequence.value_counts()
TAA 8986
TGA 3859
TAG 3488
AAA 4
CTT 3
AAG 3
TTA 3
AGG 2
GAG 2
ATA 2
GTT 2
CGA 2
ACA 2
TAT 2
CCC 1
AGT 1
TCA 1
TTC 1
TTT 1
AGC 1
CAA 1
TTG 1
AAT 1
GTG 1
CCA 1
AAC 1
TCT 1
GGC 1
GAA 1
CAG 1
ATG 1
CTG 1
Name: Sequence, dtype: int64
The canonical stop codons account for the great majority of cases, but there are some other values. This may warrant a further look into these CDSs. In this tutorial, we’ll simply exclude them from our next steps.
Let’s gather the IDs of CDSs with a canonical stop, to be used further on:
>>> is_stop = seq_last.Sequence.isin( {'TAG', 'TAA', 'TGA'} )
>>> is_stop_ids = seq_last[is_stop].ID
Writing coordinates and sequences to the disk
We want to get a “clean” annotation consisting only of canonical CDSs, with an ATG starting codon and a TAA/TAG/TGA stop codon. First, we put together the IDs of CDSs with these characteristics:
>>> clean_ids = set(is_atg_ids).intersection(set(is_stop_ids))
Then we subset the ann pyranges object:
>>> clean_ann = ann[ann.ID.isin(clean_ids)]
>>> clean_ann
+--------------+--------------+-----------+-----------+--------------+---------------------+------------------+
| Chromosome | Feature | Start | End | Strand | Parent | ID |
| (category) | (category) | (int64) | (int64) | (category) | (object) | (object) |
|--------------+--------------+-----------+-----------+--------------+---------------------+------------------|
| 0001.1 | CDS | 7327 | 7606 | + | rna-DGYR_LOCUS1 | cds-CAD5110615.1 |
| 0001.1 | CDS | 46377 | 47603 | + | rna-DGYR_LOCUS4 | cds-CAD5110625.1 |
| 0001.1 | CDS | 48406 | 49448 | + | rna-DGYR_LOCUS4 | cds-CAD5110625.1 |
| 0001.1 | CDS | 46377 | 47603 | + | rna-DGYR_LOCUS4-2 | cds-CAD5110626.1 |
| ... | ... | ... | ... | ... | ... | ... |
| 0117.1 | CDS | 20172 | 21261 | - | rna-DGYR_LOCUS14199 | cds-CAD5126985.1 |
| 0117.1 | CDS | 26816 | 27881 | - | rna-DGYR_LOCUS14201 | cds-CAD5126988.1 |
| 0117.1 | CDS | 39309 | 39450 | - | rna-DGYR_LOCUS14203 | cds-CAD5126990.1 |
| 0117.1 | CDS | 38911 | 39256 | - | rna-DGYR_LOCUS14203 | cds-CAD5126990.1 |
+--------------+--------------+-----------+-----------+--------------+---------------------+------------------+
Stranded PyRanges object has 99,155 rows and 7 columns from 117 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
We can now write this pyranges object to a file, for example in GTF format:
>>> clean_ann.to_gtf('Dgyro_annotation.canonical_CDS.gtf')
Let’s get the sequence for the canonical CDSs and write it to a tabular file.
>>> clean_ann_seq = pr.get_transcript_sequence(clean_ann, 'ID',
... 'Dgyro_genome.fa')
>>> clean_ann_seq.to_csv('Dgyro_canonical_CDS.seq.tsv',
... sep='\t', index=False)
Note that clean_ann_seq
is a pandas DataFrame. To write sequences in fasta format we use:
>>> with open('Dgyro_canonical_CDS.fa', 'w') as fw:
... for xin, xid, xseq in clean_ann_seq.itertuples():
... fw.write(f'>{xid}\n{xseq}\n')
Extending genomic intervals
Now we want to obtain (a practical approximation of) promoter sequences, here defined as the 300bp region before the start codon.
Before we begin, let’s peek into our object cds
:
>>> cds.head()
+--------------+-----------+-----------+--------------+------------------+
| Chromosome | Start | End | Strand | ID |
| (category) | (int64) | (int64) | (category) | (object) |
|--------------+-----------+-----------+--------------+------------------|
| 1.1 | 7327 | 7606 | + | cds-CAD5110615.1 |
| 1.1 | 46377 | 47603 | + | cds-CAD5110625.1 |
| 1.1 | 48406 | 49448 | + | cds-CAD5110625.1 |
| 1.1 | 46377 | 47603 | + | cds-CAD5110626.1 |
| 1.1 | 48406 | 48736 | + | cds-CAD5110626.1 |
| 1.1 | 48839 | 48912 | + | cds-CAD5110626.1 |
| 1.1 | 60099 | 60409 | + | cds-CAD5110627.1 |
| 1.1 | 60476 | 60515 | + | cds-CAD5110627.1 |
+--------------+-----------+-----------+--------------+------------------+
Stranded PyRanges object has 8 rows and 5 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
First, we use the method extend
to obtain intervals which include the CDS and the promoter defined as above:
>>> g = cds.extend({'5':300}, group_by='ID')
>>> g.head()
+--------------+-----------+-----------+--------------+------------------+
| Chromosome | Start | End | Strand | ID |
| (category) | (int64) | (int64) | (category) | (object) |
|--------------+-----------+-----------+--------------+------------------|
| 1.1 | 7027 | 7606 | + | cds-CAD5110615.1 |
| 1.1 | 46077 | 47603 | + | cds-CAD5110625.1 |
| 1.1 | 48406 | 49448 | + | cds-CAD5110625.1 |
| 1.1 | 46077 | 47603 | + | cds-CAD5110626.1 |
| 1.1 | 48406 | 48736 | + | cds-CAD5110626.1 |
| 1.1 | 48839 | 48912 | + | cds-CAD5110626.1 |
| 1.1 | 59799 | 60409 | + | cds-CAD5110627.1 |
| 1.1 | 60476 | 60515 | + | cds-CAD5110627.1 |
+--------------+-----------+-----------+--------------+------------------+
Stranded PyRanges object has 8 rows and 5 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
The first argument ensures that the 300bp extension is applied only at the 5’ (left side for + strand intervals, right side for - strand intervals).
Through the group_by
argument, we request one extension per CDS, instead of extending every interval.
In the object we obtained, the promoter corresponds to the first 300 bp of every interval group.
We can use method spliced_subsequence
again to get it:
>>> prom = g.spliced_subsequence(0, 300, 'ID')
>>> prom.head()
+--------------+-----------+-----------+--------------+------------------+
| Chromosome | Start | End | Strand | ID |
| (category) | (int64) | (int64) | (category) | (object) |
|--------------+-----------+-----------+--------------+------------------|
| 1.1 | 7027 | 7327 | + | cds-CAD5110615.1 |
| 1.1 | 46077 | 46377 | + | cds-CAD5110625.1 |
| 1.1 | 46077 | 46377 | + | cds-CAD5110626.1 |
| 1.1 | 59799 | 60099 | + | cds-CAD5110627.1 |
| 1.1 | 59799 | 60099 | + | cds-CAD5110628.1 |
| 1.1 | 72099 | 72399 | + | cds-CAD5110629.1 |
| 1.1 | 75736 | 76036 | + | cds-CAD5110630.1 |
| 1.1 | 79997 | 80297 | + | cds-CAD5110631.1 |
+--------------+-----------+-----------+--------------+------------------+
Stranded PyRanges object has 8 rows and 5 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
Because we extended intervals, some may have gone out-of-bounds on the left or on the right side: they may have a Start smaller than 0, or an End greater than the length of its chromosome, respectively. Indeed, we see there are cases of the first type:
>>> prom[prom.Start<0]
+--------------+-----------+-----------+--------------+---------------------+
| Chromosome | Start | End | Strand | ID |
| (category) | (int64) | (int64) | (category) | (object) |
|--------------+-----------+-----------+--------------+---------------------|
| 12.1 | -129 | 171 | + | cds-DGYR_LOCUS8189 |
| 18.1 | -12 | 288 | + | cds-DGYR_LOCUS10365 |
| 48.1 | -118 | 182 | + | cds-CAD5126431.1 |
| 78.1 | -299 | 1 | + | cds-CAD5126743.1 |
+--------------+-----------+-----------+--------------+---------------------+
Stranded PyRanges object has 4 rows and 5 columns from 4 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
The function genome_bounds
in submodule genomicfeatures
is designed to correct this.
We may use it to remove out-of-bounds intervals, or to retain only their in-bound portions. We go for the second option, with clip=True
:
>>> import pyfaidx
>>> pyf=pyfaidx.Fasta('Dgyro_genome.fa')
>>> cor_prom = pr.genomicfeatures.genome_bounds(prom,
... chromsizes=pyf,
... clip=True)
To detect cases of out-of-bounds on the right side, function genome_bounds
needs to know chromosome sizes.
Various input types are accepted for the chromsizes
argument; we used a pyfaidx.Fasta
object, which derives it from a fasta file.
The intervals above (and also the right-side out-of-bounds, though we don’t inspect them) have been corrected:
>>> outofbounds_left=prom[prom.Start<0].ID
>>> cor_prom[cor_prom.ID.isin(outofbounds_left)]
+--------------+-----------+-----------+--------------+---------------------+
| Chromosome | Start | End | Strand | ID |
| (category) | (int64) | (int64) | (category) | (object) |
|--------------+-----------+-----------+--------------+---------------------|
| 12.1 | 0 | 171 | + | cds-DGYR_LOCUS8189 |
| 18.1 | 0 | 288 | + | cds-DGYR_LOCUS10365 |
| 48.1 | 0 | 182 | + | cds-CAD5126431.1 |
| 78.1 | 0 | 1 | + | cds-CAD5126743.1 |
+--------------+-----------+-----------+--------------+---------------------+
Stranded PyRanges object has 4 rows and 5 columns from 4 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
Detecting overlaps among intervals
Let’s see if any of the promoter regions overlap other CDSs. Pyranges offers many efficient methods to detect overlaps, such as overlap
:
>>> cor_prom.overlap(cds, strandedness=False)
+--------------+-----------+-----------+--------------+------------------+
| Chromosome | Start | End | Strand | ID |
| (category) | (int64) | (int64) | (category) | (object) |
|--------------+-----------+-----------+--------------+------------------|
| 0001.1 | 320232 | 320532 | + | cds-CAD5110677.1 |
| 0001.1 | 371611 | 371911 | + | cds-CAD5110692.1 |
| 0001.1 | 434425 | 434725 | + | cds-CAD5110703.1 |
| 0001.1 | 445177 | 445477 | + | cds-CAD5110709.1 |
| ... | ... | ... | ... | ... |
| 0111.1 | 42019 | 42319 | - | cds-CAD5126964.1 |
| 0114.1 | 4745 | 5045 | - | cds-CAD5126973.1 |
| 0117.1 | 12844 | 13144 | + | cds-CAD5126983.1 |
| 0117.1 | 38697 | 38997 | - | cds-CAD5126989.1 |
+--------------+-----------+-----------+--------------+------------------+
Stranded PyRanges object has 2,374 rows and 5 columns from 61 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
By default, this method reports intervals in the self pyranges object (i.e., cor_prom
) that have at least 1bp of overlap with
the other pyranges (i.e., cds). By invoking strandedness=False
, we included overlaps even between intervals on opposite strands.
There are many promoters overlapping CDSs. Let’s get the overlapping regions only, using function intersect
:
>>> prom_in_cds = cor_prom.intersect(cds, strandedness=False)
>>> prom_in_cds
+--------------+-----------+-----------+--------------+------------------+
| Chromosome | Start | End | Strand | ID |
| (category) | (int64) | (int64) | (category) | (object) |
|--------------+-----------+-----------+--------------+------------------|
| 0001.1 | 320529 | 320532 | + | cds-CAD5110677.1 |
| 0001.1 | 371611 | 371744 | + | cds-CAD5110692.1 |
| 0001.1 | 371611 | 371744 | + | cds-CAD5110692.1 |
| 0001.1 | 371870 | 371911 | + | cds-CAD5110692.1 |
| ... | ... | ... | ... | ... |
| 0114.1 | 4910 | 5045 | - | cds-CAD5126973.1 |
| 0117.1 | 13016 | 13019 | + | cds-CAD5126983.1 |
| 0117.1 | 13080 | 13144 | + | cds-CAD5126983.1 |
| 0117.1 | 38911 | 38997 | - | cds-CAD5126989.1 |
+--------------+-----------+-----------+--------------+------------------+
Stranded PyRanges object has 4,185 rows and 5 columns from 61 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
intersect
returned more rows than overlap
. This is because an interval in cor_prom
may overlap multiple intervals in cds,
potentially with different intersection regions (compare the 2nd, 3rd and 4th rows, which are all subregions of the same promoter).
Therefore, intersect
returns one row for each pair of overlapping intervals, while overlap always returns a subset of rows from the self pyranges object, unaltered.
We want to remove redundancy in the object above. We use the method merge
to fuse intervals that have some overlap (and the same ID value):
>>> prom_in_cds = prom_in_cds.merge(by='ID')
>>> prom_in_cds
+--------------+-----------+-----------+--------------+------------------+
| Chromosome | Start | End | Strand | ID |
| (category) | (int64) | (int64) | (category) | (object) |
|--------------+-----------+-----------+--------------+------------------|
| 0001.1 | 320529 | 320532 | + | cds-CAD5110677.1 |
| 0001.1 | 371611 | 371744 | + | cds-CAD5110692.1 |
| 0001.1 | 371870 | 371911 | + | cds-CAD5110692.1 |
| 0001.1 | 434527 | 434725 | + | cds-CAD5110703.1 |
| ... | ... | ... | ... | ... |
| 0114.1 | 4910 | 5045 | - | cds-CAD5126973.1 |
| 0117.1 | 13016 | 13019 | + | cds-CAD5126983.1 |
| 0117.1 | 13080 | 13144 | + | cds-CAD5126983.1 |
| 0117.1 | 38911 | 38997 | - | cds-CAD5126989.1 |
+--------------+-----------+-----------+--------------+------------------+
Stranded PyRanges object has 3,040 rows and 5 columns from 61 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
Note that with overlap or intersect, we do not keep track of the coordinates of overlapping intervals in the second object (cds
),
as we only obtain those in the first object (cor_prom
). For that task, check methods cluster
and join
(not shown here).
We now want to calculate how long the overlapping regions are. We create a new column named Length
by subtracting Start
from End
.
This operation is performed element-wise (the 1st value of End minus the 1st value of Start, the 2nd End minus the 2nd Start, etc), a common paradigm of pandas Series.
>>> prom_in_cds.Length = prom_in_cds.End - prom_in_cds.Start
>>> prom_in_cds
+--------------+-----------+-----------+--------------+------------------+-----------+
| Chromosome | Start | End | Strand | ID | Length |
| (category) | (int64) | (int64) | (category) | (object) | (int64) |
|--------------+-----------+-----------+--------------+------------------+-----------|
| 0001.1 | 320529 | 320532 | + | cds-CAD5110677.1 | 3 |
| 0001.1 | 371611 | 371744 | + | cds-CAD5110692.1 | 133 |
| 0001.1 | 371870 | 371911 | + | cds-CAD5110692.1 | 41 |
| 0001.1 | 434527 | 434725 | + | cds-CAD5110703.1 | 198 |
| ... | ... | ... | ... | ... | ... |
| 0114.1 | 4910 | 5045 | - | cds-CAD5126973.1 | 135 |
| 0117.1 | 13016 | 13019 | + | cds-CAD5126983.1 | 3 |
| 0117.1 | 13080 | 13144 | + | cds-CAD5126983.1 | 64 |
| 0117.1 | 38911 | 38997 | - | cds-CAD5126989.1 | 86 |
+--------------+-----------+-----------+--------------+------------------+-----------+
Stranded PyRanges object has 3,040 rows and 6 columns from 61 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
Pandas vs Pyranges
It is convenient to think of PyRanges objects as pandas DataFrames decorated with convenient methods for genomic analyses. As seen above, PyRanges offers an interface analogous to DataFrame for data access and input/output, and it is similar when printed. Also, the columns of both object types are pandas Series.
Yet, PyRanges is not implemented as a subclass of DataFrame, as we will see shortly, so that it does not offer all its methods.
When in need of a pandas functionality missing in PyRanges, you can easily obtain a DataFrame version of it with property df
(a shortcut for method as_df
).
Note that this copies all data: avoid it if you can stick to PyRanges functions.
>>> prom_in_cds.df
Chromosome Start End Strand ID Length
0 0001.1 320529 320532 + cds-CAD5110677.1 3
1 0001.1 371611 371744 + cds-CAD5110692.1 133
2 0001.1 371870 371911 + cds-CAD5110692.1 41
3 0001.1 434527 434725 + cds-CAD5110703.1 198
4 0001.1 445384 445477 + cds-CAD5110709.1 93
... ... ... ... ... ... ...
3035 0111.1 42092 42201 - cds-CAD5126964.1 109
3036 0114.1 4910 5045 - cds-CAD5126973.1 135
3037 0117.1 13016 13019 + cds-CAD5126983.1 3
3038 0117.1 13080 13144 + cds-CAD5126983.1 64
3039 0117.1 38911 38997 - cds-CAD5126989.1 86
[3040 rows x 6 columns]
Let’s use the DataFrame for a groupby
operation wherein we get the aggregated length per promoter of regions overlapping a CDS, as pandas Series:
>>> tot_len = prom_in_cds.df.groupby("ID").Length.sum()
>>> tot_len.name = 'Tot_length'
>>> tot_len
ID
cds-CAD5110617.1 73
cds-CAD5110618.1 235
cds-CAD5110619.1 235
cds-CAD5110622.1 73
cds-CAD5110623.1 73
...
cds-DGYR_LOCUS5056 14
cds-DGYR_LOCUS5675 41
cds-DGYR_LOCUS7571-2 234
cds-DGYR_LOCUS9062 4
cds-DGYR_LOCUS980 99
Name: Tot_length, Length: 2374, dtype: int64
Let’s add this new information (how much of a CDS promoter is overlapping a different CDS) to the cds
object.
Since it is one number per CDS, all intervals with the same ID will have the same Tot_length
. This operation corresponds to a database “join”,
which is missing from PyRanges functionalities but available as pandas merge
:
>>> z = cds.df.merge(tot_len, on='ID', how='left')
>>> z.Tot_length.fillna(0, inplace=True, downcast='infer')
>>> z
Chromosome Start End Strand ID Tot_length
0 0001.1 7327 7606 + cds-CAD5110615.1 0
1 0001.1 46377 47603 + cds-CAD5110625.1 0
2 0001.1 48406 49448 + cds-CAD5110625.1 0
3 0001.1 46377 47603 + cds-CAD5110626.1 0
4 0001.1 48406 48736 + cds-CAD5110626.1 0
... ... ... ... ... ... ...
100035 0117.1 20172 21261 - cds-CAD5126985.1 0
100036 0117.1 26816 27881 - cds-CAD5126988.1 0
100037 0117.1 36183 38697 - cds-CAD5126989.1 86
100038 0117.1 39309 39450 - cds-CAD5126990.1 0
100039 0117.1 38911 39256 - cds-CAD5126990.1 0
[100040 rows x 6 columns]
Only some CDSs have a promoter overlapping another CDS, so we used how=’left’ when calling merge
.
This retains all rows of cds
, introducing NaN values for IDs missing in tot_len. On the next line of code, we filled NaN with zeros.
Now let’s convert the resulting DataFrame z
back to PyRanges:
>>> cds = pr.PyRanges(z)
>>> cds
+--------------+-----------+-----------+--------------+------------------+--------------+
| Chromosome | Start | End | Strand | ID | Tot_length |
| (category) | (int64) | (int64) | (category) | (object) | (int64) |
|--------------+-----------+-----------+--------------+------------------+--------------|
| 0001.1 | 7327 | 7606 | + | cds-CAD5110615.1 | 0 |
| 0001.1 | 46377 | 47603 | + | cds-CAD5110625.1 | 0 |
| 0001.1 | 48406 | 49448 | + | cds-CAD5110625.1 | 0 |
| 0001.1 | 46377 | 47603 | + | cds-CAD5110626.1 | 0 |
| ... | ... | ... | ... | ... | ... |
| 0117.1 | 26816 | 27881 | - | cds-CAD5126988.1 | 0 |
| 0117.1 | 36183 | 38697 | - | cds-CAD5126989.1 | 86 |
| 0117.1 | 39309 | 39450 | - | cds-CAD5126990.1 | 0 |
| 0117.1 | 38911 | 39256 | - | cds-CAD5126990.1 | 0 |
+--------------+-----------+-----------+--------------+------------------+--------------+
Stranded PyRanges object has 100,040 rows and 6 columns from 117 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
Now let’s dig into the differences of PyRanges and DataFrame.
Say that we want to order our intervals by Tot_length
. We use PyRanges method sort
:
>>> srt_cds = cds.sort('Tot_length')
>>> srt_cds
+--------------+-----------+-----------+--------------+------------------+--------------+
| Chromosome | Start | End | Strand | ID | Tot_length |
| (category) | (int64) | (int64) | (category) | (object) | (int64) |
|--------------+-----------+-----------+--------------+------------------+--------------|
| 0001.1 | 7327 | 7606 | + | cds-CAD5110615.1 | 0 |
| 0001.1 | 2092958 | 2093126 | + | cds-CAD5111233.1 | 0 |
| 0001.1 | 2093184 | 2093293 | + | cds-CAD5111233.1 | 0 |
| 0001.1 | 2093350 | 2093443 | + | cds-CAD5111233.1 | 0 |
| ... | ... | ... | ... | ... | ... |
| 0117.1 | 26816 | 27881 | - | cds-CAD5126988.1 | 0 |
| 0117.1 | 39309 | 39450 | - | cds-CAD5126990.1 | 0 |
| 0117.1 | 38911 | 39256 | - | cds-CAD5126990.1 | 0 |
| 0117.1 | 36183 | 38697 | - | cds-CAD5126989.1 | 86 |
+--------------+-----------+-----------+--------------+------------------+--------------+
Stranded PyRanges object has 100,040 rows and 6 columns from 117 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
If we sort the analogous DataFrame using pandas sort_values
, we see the results does not quite look the same:
>>> cds.df.sort_values('Tot_length')
Chromosome Start End Strand ID Tot_length
0 0001.1 7327 7606 + cds-CAD5110615.1 0
64767 0013.1 1245868 1246381 - cds-CAD5121003.1 0
64766 0013.1 1246440 1246707 - cds-CAD5121003.1 0
64765 0013.1 1246768 1246871 - cds-CAD5121003.1 0
64764 0013.1 1246936 1247072 - cds-CAD5121003.1 0
... ... ... ... ... ... ...
9380 0002.1 258719 259098 - cds-CAD5111666.1 300
64403 0013.1 777119 777332 - cds-CAD5120886.1 300
9381 0002.1 258298 258496 - cds-CAD5111666.1 300
64404 0013.1 776667 776756 - cds-CAD5120886.1 300
48837 0009.1 731837 731964 + cds-CAD5118662.1 300
[100040 rows x 6 columns]
Why is that? Under the hood, each PyRanges object is a collection of DataFrames: data is spread into separate tables,
one for each chromosome/strand pair; e.g. there is a DataFrame with coordinates for + intervals on chr1, one for - intervals on chr1,
one for + on chr2, one for - on chr2, etc. The user typically does not need to directly access them (but if you do,
you can check the dictionary-type PyRanges attribute dfs
).
Intervals on different chromosomes have no order relative to each other, and they are never mixed up in the same table. Indeed, when inspecting a PyRanges object, you see the message:
For printing, the PyRanges was sorted on Chromosome and Strand.
PyRanges sort
therefore acts on each table independently. Pandas, on the other hand,
has no problems mixing up rows corresponding to different chromosomes, which explains the discrepancy seen above.
This leads to another important difference. In pandas, the index is an essential component of the DataFrame, providing the row labels, order, and a tool for data access. In PyRanges objects, there is no index. Their internal tables of course have their own indices, but they are purposely hidden from the user as they are not to be queried or relied upon.
The user should also beware of methods merge
and join
, which have different meanings.
As seen above, in Pandas they are slight variations of database “join”, while in PyRanges they refer to interval manipulation based on genomic overlap.
Method chaining and custom functions
Like Pandas, pyranges support method chaining.
Some useful methods in this sense are: pc
, which prints the object and returns it for further chaining;
subset
, which performs row selection; and assign, which adds a column.
We may chain these to obtain at once the CDS subset, add the length of intervals as new column, drop a couple of columns, then print before and after sorting by length:
>>> ( ann.subset(lambda x:x.Feature=='CDS')
... .assign('Length', lambda x:x.End-x.Start)
... .drop(['Parent', 'Feature'])
... .pc()
... .sort('Length')
... .print()
... )
+--------------+-----------+-----------+--------------+------------------+-----------+
| Chromosome | Start | End | Strand | ID | Length |
| (category) | (int64) | (int64) | (category) | (object) | (int64) |
|--------------+-----------+-----------+--------------+------------------+-----------|
| 0001.1 | 7327 | 7606 | + | cds-CAD5110615.1 | 279 |
| 0001.1 | 46377 | 47603 | + | cds-CAD5110625.1 | 1226 |
| 0001.1 | 48406 | 49448 | + | cds-CAD5110625.1 | 1042 |
| 0001.1 | 46377 | 47603 | + | cds-CAD5110626.1 | 1226 |
| ... | ... | ... | ... | ... | ... |
| 0117.1 | 26816 | 27881 | - | cds-CAD5126988.1 | 1065 |
| 0117.1 | 36183 | 38697 | - | cds-CAD5126989.1 | 2514 |
| 0117.1 | 39309 | 39450 | - | cds-CAD5126990.1 | 141 |
| 0117.1 | 38911 | 39256 | - | cds-CAD5126990.1 | 345 |
+--------------+-----------+-----------+--------------+------------------+-----------+
Stranded PyRanges object has 100,040 rows and 6 columns from 117 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
+--------------+-----------+-----------+--------------+------------------+-----------+
| Chromosome | Start | End | Strand | ID | Length |
| (category) | (int64) | (int64) | (category) | (object) | (int64) |
|--------------+-----------+-----------+--------------+------------------+-----------|
| 0001.1 | 2789047 | 2789050 | + | cds-CAD5111421.1 | 3 |
| 0001.1 | 909263 | 909266 | + | cds-CAD5110841.1 | 3 |
| 0001.1 | 1721311 | 1721314 | + | cds-CAD5111112.1 | 3 |
| 0001.1 | 651283 | 651286 | + | cds-CAD5110767.1 | 3 |
| ... | ... | ... | ... | ... | ... |
| 0117.1 | 38911 | 39256 | - | cds-CAD5126990.1 | 345 |
| 0117.1 | 26816 | 27881 | - | cds-CAD5126988.1 | 1065 |
| 0117.1 | 20172 | 21261 | - | cds-CAD5126985.1 | 1089 |
| 0117.1 | 36183 | 38697 | - | cds-CAD5126989.1 | 2514 |
+--------------+-----------+-----------+--------------+------------------+-----------+
Stranded PyRanges object has 100,040 rows and 6 columns from 117 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
With assign
and subset
, you provide a function which is applied to each DataFrame in the collection.
In both cases, it must return a Series with the same number of rows as the input PyRanges.
For subset
, it must be a boolean Series, which is used as row selector. Another similar method called apply
also exists.
Use apply
when your function returns a DataFrame which can be converted to a PyRanges object (i.e. containing Chromosome, Start, End, Strand columns).
In the following code, we select only CDS intervals whose center point is within 50 nucleotides from the middle of a chromosome.
For this, we precompute a table of chromosome sizes from the object pyf used earlier, and we use pandas merge
, through pyranges apply
,
to join this table with each dataframe.
>>> chromsizes = pd.DataFrame.from_dict(
... {'Chromosome':[k for k,v in pyf.items()],
... 'chromsize':[len(v) for k,v in pyf.items()]}
... )
>>> (ann.subset(lambda x:x.Feature=='CDS')
... .drop(['Parent', 'Feature', 'ID'])
... .apply(lambda x:x.merge(chromsizes, on='Chromosome'))
... .assign('midpoint', lambda x:(x.End+x.Start)/2)
... .subset(lambda x:abs(x.midpoint - x.chromsize/2)<50)
... )
+--------------+-----------+-----------+--------------+-------------+-------------+
| Chromosome | Start | End | Strand | chromsize | midpoint |
| (object) | (int64) | (int64) | (category) | (int64) | (float64) |
|--------------+-----------+-----------+--------------+-------------+-------------|
| 0003.1 | 1616314 | 1616368 | + | 3232594 | 1616341.0 |
| 0005.1 | 2273266 | 2273365 | + | 4546684 | 2273315.5 |
| 0013.1 | 861154 | 861506 | + | 1722566 | 861330.0 |
| 0014.1 | 1120851 | 1120981 | - | 2241898 | 1120916.0 |
| ... | ... | ... | ... | ... | ... |
| 0034.1 | 98757 | 98889 | - | 197648 | 98823.0 |
| 0057.1 | 47117 | 47335 | + | 94438 | 47226.0 |
| 0065.1 | 81326 | 81695 | + | 162924 | 81510.5 |
| 0098.1 | 63943 | 64035 | + | 127883 | 63989.0 |
+--------------+-----------+-----------+--------------+-------------+-------------+
Stranded PyRanges object has 12 rows and 6 columns from 10 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
A common operation in pandas is group by then apply, i.e. dividing the table in groups and performing certain operations on each group.
You can do such operations using subset
, assign
, or apply
, depending on what to do with the result.
Remember that the table of each Chromosome/Strand is processed independently.
Let’s use this functionality to get the first (5’-most) exon of each CDS group.
We use pyranges sort
with the argument ‘5’, which puts intervals in 5’ -> 3’ order, then we apply a groupby+apply pandas chain:
>>> ( ann.subset(lambda x:x.Feature=='CDS').drop(['Parent', 'Feature']).sort('5').apply(lambda x:x.groupby('ID', as_index=False).first()))
+------------------+--------------+-----------+-----------+--------------+
| ID | Chromosome | Start | End | Strand |
| (object) | (category) | (int64) | (int64) | (category) |
|------------------+--------------+-----------+-----------+--------------|
| cds-CAD5110615.1 | 0001.1 | 7327 | 7606 | + |
| cds-CAD5110625.1 | 0001.1 | 46377 | 47603 | + |
| cds-CAD5110626.1 | 0001.1 | 46377 | 47603 | + |
| cds-CAD5110627.1 | 0001.1 | 60099 | 60409 | + |
| ... | ... | ... | ... | ... |
| cds-CAD5126985.1 | 0117.1 | 20172 | 21261 | - |
| cds-CAD5126988.1 | 0117.1 | 26816 | 27881 | - |
| cds-CAD5126989.1 | 0117.1 | 36183 | 38697 | - |
| cds-CAD5126990.1 | 0117.1 | 39309 | 39450 | - |
+------------------+--------------+-----------+-----------+--------------+
Stranded PyRanges object has 16,378 rows and 5 columns from 117 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
This concludes our tutorial. The next pages will delve into pyranges functionalities grouped by topic.