Tutorial

PyRanges objects represent sequence intervals (“ranges”) such as genomic regions. Examples include gene annotations, mapped reads, protein domains, and results from Blast or analogous software. PyRanges provides convenient methods to load data in GFF, GTF, BED, BAM format. In this tutorial, we refer to PyRanges objects for their typical use case, i.e. representing genomic intervals, but the same concepts and methods may be applied to RNA or protein sequences.

Every interval in a PyRanges object is defined, at minimum, by its chromosome and start and end coordinates. Optionally, the strand (+ or -) can also be present; if so, the PyRanges object is “Stranded”. The ranges can additionally have an arbitrary number of meta-data fields, i.e. columns associated with them.

Note that PyRanges follows the standard python convention:

  • coordinates are 0-based

  • in each interval, the start is included and the end is excluded.

Some genomic coordinate formats (GFF, GTF) use a 1-based, start-and-end-included format. PyRanges takes care of converting between these conventions when loading and writing files in these formats. The data in PyRanges objects are stored as Pandas dataframes. This means that the vast Python ecosystem for high-performance scientific computing is available to manipulate the data in PyRanges objects. While not strictly necessary, having familiarity with pandas greatly facilitates the use of PyRanges. In this tutorial and in how-to pages, we will sometimes mix PyRanges and pandas functionalities.

Getting started

For this tutorial, we will use real data consisting of the genome sequence (fasta format) and annotation (GFF3 format) of a worm with a very small but complex genome (Dimorphilus gyrociliatus, assembly ID GCA_904063045.1). These data were modified for this tutorial only to shorten IDs.

Download and unpack tutorial data with:

curl -O https://mariottigenomicslab.bio.ub.edu/pyranges_data/pyranges_tutorial_data.tar.gz
tar zxf pyranges_tutorial_data.tar.gz

We recommend using ipython or Jupyter for this tutorial. Besides pyranges and some of its dependencies, we will use the optional module pyfaidx. Make sure to install it with:

pip install pyfaidx

Or with:

conda install -c bioconda pyfaidx

Loading and accessing pyranges objects

Let’s import libraries and load the annotation into a PyRanges object called ann:

>>> import pyranges as pr, pandas as pd
>>> ann = pr.read_gff3('Dgyro_annotation.gff')
>>> ann
+--------------+------------+--------------+-----------+-----------+------------+--------------+------------+--------------------+---------------+------------+-------------+-------+
| Chromosome   | Source     | Feature      | Start     | End       | Score      | Strand       | Frame      | ID                 | Dbxref        | gbkey      | mol_type    | +13   |
| (category)   | (object)   | (category)   | (int64)   | (int64)   | (object)   | (category)   | (object)   | (object)           | (object)      | (object)   | (object)    | ...   |
|--------------+------------+--------------+-----------+-----------+------------+--------------+------------+--------------------+---------------+------------+-------------+-------|
| 0001.1       | EMBL       | region       | 0         | 3609319   | .          | +            | .          | 0001.1:1..3609319  | taxon:2664684 | Src        | genomic DNA | ...   |
| 0001.1       | EMBL       | gene         | 7326      | 7606      | .          | +            | .          | gene-DGYR_LOCUS1   | nan           | Gene       | nan         | ...   |
| 0001.1       | EMBL       | mRNA         | 7326      | 7606      | .          | +            | .          | rna-DGYR_LOCUS1    | nan           | mRNA       | nan         | ...   |
| 0001.1       | EMBL       | exon         | 7326      | 7606      | .          | +            | .          | exon-DGYR_LOCUS1-1 | nan           | mRNA       | nan         | ...   |
| ...          | ...        | ...          | ...       | ...       | ...        | ...          | ...        | ...                | ...           | ...        | ...         | ...   |
| 0346.1       | EMBL       | region       | 0         | 1411      | .          | +            | .          | 0346.1:1..1411     | taxon:2664684 | Src        | genomic DNA | ...   |
| 0347.1       | EMBL       | region       | 0         | 1094      | .          | +            | .          | 0347.1:1..1094     | taxon:2664684 | Src        | genomic DNA | ...   |
| 0348.1       | EMBL       | region       | 0         | 667       | .          | +            | .          | 0348.1:1..667      | taxon:2664684 | Src        | genomic DNA | ...   |
| 0349.1       | EMBL       | region       | 0         | 641       | .          | +            | .          | 0349.1:1..641      | taxon:2664684 | Src        | genomic DNA | ...   |
+--------------+------------+--------------+-----------+-----------+------------+--------------+------------+--------------------+---------------+------------+-------------+-------+
Stranded PyRanges object has 241,137 rows and 25 columns from 349 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
13 hidden columns: note, Name, gene_biotype, locus_tag, Parent, Note, standard_name, product, protein_id, pseudo, partial, start_range, end_range

The ann object has lots of columns, most of which are not displayed because of space. Here’s the full list:

>>> ann.columns
Index(['Chromosome', 'Source', 'Feature', 'Start', 'End', 'Score', 'Strand',
       'Frame', 'ID', 'Dbxref', 'gbkey', 'mol_type', 'note', 'Name',
       'gene_biotype', 'locus_tag', 'Parent', 'Note', 'standard_name',
       'product', 'protein_id', 'pseudo', 'partial', 'start_range',
       'end_range'],
      dtype='object')

Let’s select only certain columns:

>>> ann = ann[ ['Feature', 'Parent', 'ID'] ]
>>> ann
+--------------+--------------+-----------+-----------+--------------+------------------+--------------------+
| Chromosome   | Feature      | Start     | End       | Strand       | Parent           | ID                 |
| (category)   | (category)   | (int64)   | (int64)   | (category)   | (object)         | (object)           |
|--------------+--------------+-----------+-----------+--------------+------------------+--------------------|
| 0001.1       | region       | 0         | 3609319   | +            | nan              | 0001.1:1..3609319  |
| 0001.1       | gene         | 7326      | 7606      | +            | nan              | gene-DGYR_LOCUS1   |
| 0001.1       | mRNA         | 7326      | 7606      | +            | gene-DGYR_LOCUS1 | rna-DGYR_LOCUS1    |
| 0001.1       | exon         | 7326      | 7606      | +            | rna-DGYR_LOCUS1  | exon-DGYR_LOCUS1-1 |
| ...          | ...          | ...       | ...       | ...          | ...              | ...                |
| 0346.1       | region       | 0         | 1411      | +            | nan              | 0346.1:1..1411     |
| 0347.1       | region       | 0         | 1094      | +            | nan              | 0347.1:1..1094     |
| 0348.1       | region       | 0         | 667       | +            | nan              | 0348.1:1..667      |
| 0349.1       | region       | 0         | 641       | +            | nan              | 0349.1:1..641      |
+--------------+--------------+-----------+-----------+--------------+------------------+--------------------+
Stranded PyRanges object has 241,137 rows and 7 columns from 349 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

As seen above, column selection syntax is analogous to pandas. However, a difference is that PyRanges retained the essential columns Chromosome, Start, End, Strand even though we did not select them.

The Chromosome column can take any value among the sequence names in the genome assembly. Only for the best quality assemblies it corresponds to actual chromosomes, and in other cases it is contigs or scaffolds; for simplicity, here we refer to it as chromosomes. In a fasta file, the sequence name is the first word of a header line (i.e. those starting with “>”). We can have a peek in the assembly in bash:

grep ">" Dgyro_genome.fa | head
>0001.1 Dimorphilus gyrociliatus genome assembly, contig: scaffold001, whole genome shotgun sequence
>0002.1 Dimorphilus gyrociliatus genome assembly, contig: scaffold002, whole genome shotgun sequence
>0003.1 Dimorphilus gyrociliatus genome assembly, contig: scaffold003, whole genome shotgun sequence
>0004.1 Dimorphilus gyrociliatus genome assembly, contig: scaffold004, whole genome shotgun sequence
>0005.1 Dimorphilus gyrociliatus genome assembly, contig: scaffold005, whole genome shotgun sequence
>0006.1 Dimorphilus gyrociliatus genome assembly, contig: scaffold006, whole genome shotgun sequence
>0007.1 Dimorphilus gyrociliatus genome assembly, contig: scaffold007, whole genome shotgun sequence
>0008.1 Dimorphilus gyrociliatus genome assembly, contig: scaffold008, whole genome shotgun sequence
>0009.1 Dimorphilus gyrociliatus genome assembly, contig: scaffold009, whole genome shotgun sequence
>0010.1 Dimorphilus gyrociliatus genome assembly, contig: scaffold010, whole genome shotgun sequence

Genomic annotations often contain information for diverse entities, such as genes, mRNAs, exons, CDS, etc. In GFF files, the entity type is encoded in the Feature column. In pyranges, you use the dot notation to fetch an individual column, which is technically a pandas Series:

>>> ann.Feature
0    region
1      gene
2      mRNA
3      exon
4       CDS
      ...
0    region
0    region
0    region
0    region
0    region
Name: Feature, Length: 241137, dtype: category
Categories (6, object): ['CDS', 'exon', 'gene', 'mRNA', 'pseudogene', 'region']

Let’s focus on a subset of the annotation: CDS intervals, corresponding to coding sequences. We filter rows and create a new PyRanges object called cds:

>>> selector = (ann.Feature == 'CDS')
>>> cds = ann [selector]

We used another syntax familiar to pandas users. The object selector is a Series of boolean values, so it can be used to index PyRanges.

Let’s further reduce the width of the cds object. We showcase an alternative method for column selection: the method drop lets us choose which columns to discard.

>>> cds = cds.drop( ['Feature', 'Parent'] )
>>> cds
+--------------+-----------+-----------+--------------+------------------+
| Chromosome   | Start     | End       | Strand       | ID               |
| (category)   | (int64)   | (int64)   | (category)   | (object)         |
|--------------+-----------+-----------+--------------+------------------|
| 0001.1       | 7327      | 7606      | +            | cds-CAD5110615.1 |
| 0001.1       | 46377     | 47603     | +            | cds-CAD5110625.1 |
| 0001.1       | 48406     | 49448     | +            | cds-CAD5110625.1 |
| 0001.1       | 46377     | 47603     | +            | cds-CAD5110626.1 |
| ...          | ...       | ...       | ...          | ...              |
| 0117.1       | 26816     | 27881     | -            | cds-CAD5126988.1 |
| 0117.1       | 36183     | 38697     | -            | cds-CAD5126989.1 |
| 0117.1       | 39309     | 39450     | -            | cds-CAD5126990.1 |
| 0117.1       | 38911     | 39256     | -            | cds-CAD5126990.1 |
+--------------+-----------+-----------+--------------+------------------+
Stranded PyRanges object has 100,040 rows and 5 columns from 117 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

The interface shown so far is analogous to pandas. Additionally, pyranges offers a non-pandas syntax for selecting intervals in a genomic region of interest (i.e. region retrieval). The code below will show only intervals completely included in the specified position range in the requested chromosome:

>>> cds['0002.1', 145000:150000]
+--------------+-----------+-----------+--------------+------------------+
|   Chromosome |     Start |       End | Strand       | ID               |
|   (category) |   (int64) |   (int64) | (category)   | (object)         |
|--------------+-----------+-----------+--------------+------------------|
|          2.1 |    146123 |    146246 | +            | cds-CAD5111630.1 |
|          2.1 |    146306 |    147077 | +            | cds-CAD5111630.1 |
|          2.1 |    147830 |    147976 | +            | cds-CAD5111631.1 |
|          2.1 |    148191 |    148297 | +            | cds-CAD5111631.1 |
|          2.1 |    148360 |    148489 | +            | cds-CAD5111631.1 |
|          2.1 |    145002 |    145116 | -            | cds-CAD5111629.1 |
|          2.1 |    145176 |    145262 | -            | cds-CAD5111629.1 |
|          2.1 |    145320 |    145435 | -            | cds-CAD5111629.1 |
+--------------+-----------+-----------+--------------+------------------+
Stranded PyRanges object has 8 rows and 5 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

The syntax for region retrieval may consists of:

  • chromosome

  • chromosome, position slice (as the example above)

  • chromosome, strand, position slice

So, for example, this is also valid:

>>> cds['0002.1', "-", 145000:150000]
+--------------+-----------+-----------+--------------+------------------+
|   Chromosome |     Start |       End | Strand       | ID               |
|   (category) |   (int64) |   (int64) | (category)   | (object)         |
|--------------+-----------+-----------+--------------+------------------|
|          2.1 |    145002 |    145116 | -            | cds-CAD5111629.1 |
|          2.1 |    145176 |    145262 | -            | cds-CAD5111629.1 |
|          2.1 |    145320 |    145435 | -            | cds-CAD5111629.1 |
+--------------+-----------+-----------+--------------+------------------+
Stranded PyRanges object has 3 rows and 5 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

It is important to differentiate between Stranded and Unstranded PyRanges objects. When a Strand column is present and all its values are “+” or “-”, the object is Stranded. When there are invalid values (e.g. “.”) or the Strand column is absent, it is Unstranded. You can check whether an interval is Stranded with:

>>> cds.stranded
True

Certain pyranges methods require a Stranded input. While the annotation used in this tutorial is naturally Stranded, others may not be. If necessary, you may use method make_stranded to transform all invalid Strand values to “+” or remove them.

Working with groups of exons

Multi-exonic genes are represented with multiple rows in PyRanges. In this tutorial, the ID column links the intervals belonging to the same CDS: these rows have the same ID value. While this concept applies to all annotations, files from different sources may use different column names for this purpose (e.g. transcript_id). Note that here we focus on CDS regions. These may encompass multiple exons, but they do not span the whole mRNA: the 5’UTRs and 3’UTRs are not included.

Next, we will examine the first and last codon of annotated CDSs. We will obtain their genomic coordinate, then fetch their sequence.

Method spliced_subsequence allows to obtain a subregion of groups of intervals. The code below derives the first codon of each CDS group (grouping is defined by their ID):

>>> first=cds.spliced_subsequence(start=0, end=3, by='ID')
>>> first
+--------------+-----------+-----------+--------------+------------------+
| Chromosome   | Start     | End       | Strand       | ID               |
| (category)   | (int64)   | (int64)   | (category)   | (object)         |
|--------------+-----------+-----------+--------------+------------------|
| 0001.1       | 7327      | 7330      | +            | cds-CAD5110615.1 |
| 0001.1       | 46377     | 46380     | +            | cds-CAD5110625.1 |
| 0001.1       | 46377     | 46380     | +            | cds-CAD5110626.1 |
| 0001.1       | 60099     | 60102     | +            | cds-CAD5110627.1 |
| ...          | ...       | ...       | ...          | ...              |
| 0117.1       | 38694     | 38697     | -            | cds-CAD5126989.1 |
| 0117.1       | 27878     | 27881     | -            | cds-CAD5126988.1 |
| 0117.1       | 21258     | 21261     | -            | cds-CAD5126985.1 |
| 0117.1       | 14274     | 14277     | -            | cds-CAD5126984.1 |
+--------------+-----------+-----------+--------------+------------------+
Stranded PyRanges object has 16,391 rows and 5 columns from 117 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Let’s fetch the sequence for each of these intervals from our genome fasta file. The function get_sequence returns one sequence per interval, which we assign to a new column of our pyranges object:

>>> first.Sequence = pr.get_sequence(first, 'Dgyro_genome.fa')
>>> first
+--------------+-----------+-----------+--------------+------------------+------------+
| Chromosome   | Start     | End       | Strand       | ID               | Sequence   |
| (category)   | (int64)   | (int64)   | (category)   | (object)         | (object)   |
|--------------+-----------+-----------+--------------+------------------+------------|
| 0001.1       | 7327      | 7330      | +            | cds-CAD5110615.1 | ATG        |
| 0001.1       | 46377     | 46380     | +            | cds-CAD5110625.1 | ATG        |
| 0001.1       | 46377     | 46380     | +            | cds-CAD5110626.1 | ATG        |
| 0001.1       | 60099     | 60102     | +            | cds-CAD5110627.1 | ATG        |
| ...          | ...       | ...       | ...          | ...              | ...        |
| 0117.1       | 38694     | 38697     | -            | cds-CAD5126989.1 | ATT        |
| 0117.1       | 27878     | 27881     | -            | cds-CAD5126988.1 | ATG        |
| 0117.1       | 21258     | 21261     | -            | cds-CAD5126985.1 | ATG        |
| 0117.1       | 14274     | 14277     | -            | cds-CAD5126984.1 | ATG        |
+--------------+-----------+-----------+--------------+------------------+------------+
Stranded PyRanges object has 16,391 rows and 6 columns from 117 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

The Sequence column is a pandas Series containing strings. We see that the starting codon is ATG in most cases, as expected. When we check the length of the sequences, we notice that some are not 3-letter long:

>>> (first.Sequence.str.len() == 3 ).all()
False

Let’s look at those sequences, using a row selector as before:

>>> first [ first.Sequence.str.len() != 3 ]
+--------------+-----------+-----------+--------------+------------------+------------+
| Chromosome   | Start     | End       | Strand       | ID               | Sequence   |
| (category)   | (int64)   | (int64)   | (category)   | (object)         | (object)   |
|--------------+-----------+-----------+--------------+------------------+------------|
| 0001.1       | 667699    | 667700    | -            | cds-CAD5110773.1 | a          |
| 0001.1       | 667641    | 667643    | -            | cds-CAD5110773.1 | TG         |
| 0002.1       | 2632107   | 2632109   | -            | cds-CAD5112293.1 | AT         |
| 0002.1       | 2631440   | 2631441   | -            | cds-CAD5112293.1 | G          |
| ...          | ...       | ...       | ...          | ...              | ...        |
| 0024.1       | 1091339   | 1091341   | -            | cds-CAD5125104.1 | AT         |
| 0024.1       | 1091163   | 1091164   | -            | cds-CAD5125104.1 | G          |
| 0025.1       | 39753     | 39755     | -            | cds-CAD5125115.1 | at         |
| 0025.1       | 39692     | 39693     | -            | cds-CAD5125115.1 | g          |
+--------------+-----------+-----------+--------------+------------------+------------+
Stranded PyRanges object has 26 rows and 6 columns from 11 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

In some cases the starting codon is split between two exons. This is uncommon, but we are looking at all protein coding genes of a species, so it is expected at least in a few cases. How do we get the full codon sequence?

Instead of get_sequence, let’s use get_transcript_sequence, which returns the concatenated sequence of a group of intervals, i.e. joining exons together. The sequence is given 5’ to 3’.

>>> seq_first = pr.get_transcript_sequence(
...      first,
...      group_by='ID',
...      path='Dgyro_genome.fa'
...     )
>>> seq_first
                       ID Sequence
0        cds-CAD5110614.1      ATG
1        cds-CAD5110615.1      ATG
2        cds-CAD5110616.1      atg
3        cds-CAD5110617.1      atg
4        cds-CAD5110618.1      ATG
...                   ...      ...
16373  cds-DGYR_LOCUS9540      atg
16374  cds-DGYR_LOCUS9596      ATG
16375  cds-DGYR_LOCUS9732      ATG
16376   cds-DGYR_LOCUS980      ATG
16377  cds-DGYR_LOCUS9980      ATG

[16378 rows x 2 columns]

seq_first is not a PyRanges object, but a pandas DataFrame. It has a column for the group (ID) and one for the sequence. Here we confirm the sequence length is always 3:

>>> (seq_first.Sequence.str.len()==3).all()
True

Finally, let’s quantify how many start codons are ATG, using a bit of pandas magic. First, we make sure the whole sequence is in uppercase characters. Then, we make a boolean Series is_atg which has True corresponding to the ATG codons, then we sum its values to count the instances of True, creating variable n_atg. We also store the IDs of the CDSs with ATG codons in the variable is_atg_ids. Finally, we print a summary:

>>> seq_first.Sequence = seq_first.Sequence.str.upper()
>>> is_atg = (seq_first.Sequence == 'ATG')
>>> is_atg_ids = seq_first[is_atg].ID
>>> n_atg = is_atg.sum()
>>> print(f'There are {n_atg} ATG start codons out of '
... f'{len(seq_first)} CDSs => {n_atg/len(seq_first):.2%}')
There are 16339 ATG start codons out of 16378 CDSs => 99.76%

Now, we want to perform an analogous analysis with stop codons. First, we get the a pyranges object of the last codon of each CDS. Conveniently, the method spliced_subsequence accepts negative arguments to count from the 3’, so we can obtain the last three nucleotides of CDSs with:

>>> last = cds.spliced_subsequence(start=-3, by='ID')

By not providing an end argument, we requested intervals that reach the very end of each CDS group. Let’s get their sequence as before, then use pandas function value_counts to count them:

>>> seq_last = pr.get_transcript_sequence(last, 'ID',
...      'Dgyro_genome.fa')
>>> seq_last.Sequence = seq_last.Sequence.str.upper()
>>> seq_last.Sequence.value_counts()
TAA    8986
TGA    3859
TAG    3488
AAA       4
CTT       3
AAG       3
TTA       3
AGG       2
GAG       2
ATA       2
GTT       2
CGA       2
ACA       2
TAT       2
CCC       1
AGT       1
TCA       1
TTC       1
TTT       1
AGC       1
CAA       1
TTG       1
AAT       1
GTG       1
CCA       1
AAC       1
TCT       1
GGC       1
GAA       1
CAG       1
ATG       1
CTG       1
Name: Sequence, dtype: int64

The canonical stop codons account for the great majority of cases, but there are some other values. This may warrant a further look into these CDSs. In this tutorial, we’ll simply exclude them from our next steps.

Let’s gather the IDs of CDSs with a canonical stop, to be used further on:

>>> is_stop = seq_last.Sequence.isin( {'TAG', 'TAA', 'TGA'} )
>>> is_stop_ids = seq_last[is_stop].ID

Writing coordinates and sequences to the disk

We want to get a “clean” annotation consisting only of canonical CDSs, with an ATG starting codon and a TAA/TAG/TGA stop codon. First, we put together the IDs of CDSs with these characteristics:

>>> clean_ids = set(is_atg_ids).intersection(set(is_stop_ids))

Then we subset the ann pyranges object:

>>> clean_ann = ann[ann.ID.isin(clean_ids)]
>>> clean_ann
+--------------+--------------+-----------+-----------+--------------+---------------------+------------------+
| Chromosome   | Feature      | Start     | End       | Strand       | Parent              | ID               |
| (category)   | (category)   | (int64)   | (int64)   | (category)   | (object)            | (object)         |
|--------------+--------------+-----------+-----------+--------------+---------------------+------------------|
| 0001.1       | CDS          | 7327      | 7606      | +            | rna-DGYR_LOCUS1     | cds-CAD5110615.1 |
| 0001.1       | CDS          | 46377     | 47603     | +            | rna-DGYR_LOCUS4     | cds-CAD5110625.1 |
| 0001.1       | CDS          | 48406     | 49448     | +            | rna-DGYR_LOCUS4     | cds-CAD5110625.1 |
| 0001.1       | CDS          | 46377     | 47603     | +            | rna-DGYR_LOCUS4-2   | cds-CAD5110626.1 |
| ...          | ...          | ...       | ...       | ...          | ...                 | ...              |
| 0117.1       | CDS          | 20172     | 21261     | -            | rna-DGYR_LOCUS14199 | cds-CAD5126985.1 |
| 0117.1       | CDS          | 26816     | 27881     | -            | rna-DGYR_LOCUS14201 | cds-CAD5126988.1 |
| 0117.1       | CDS          | 39309     | 39450     | -            | rna-DGYR_LOCUS14203 | cds-CAD5126990.1 |
| 0117.1       | CDS          | 38911     | 39256     | -            | rna-DGYR_LOCUS14203 | cds-CAD5126990.1 |
+--------------+--------------+-----------+-----------+--------------+---------------------+------------------+
Stranded PyRanges object has 99,155 rows and 7 columns from 117 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

We can now write this pyranges object to a file, for example in GTF format:

>>> clean_ann.to_gtf('Dgyro_annotation.canonical_CDS.gtf')

Let’s get the sequence for the canonical CDSs and write it to a tabular file.

>>> clean_ann_seq = pr.get_transcript_sequence(clean_ann, 'ID',
...           'Dgyro_genome.fa')
>>> clean_ann_seq.to_csv('Dgyro_canonical_CDS.seq.tsv',
...                sep='\t', index=False)

Note that clean_ann_seq is a pandas DataFrame. To write sequences in fasta format we use:

>>> with open('Dgyro_canonical_CDS.fa', 'w') as fw: 
...   for xin, xid, xseq in clean_ann_seq.itertuples():
...     fw.write(f'>{xid}\n{xseq}\n')

Extending genomic intervals

Now we want to obtain (a practical approximation of) promoter sequences, here defined as the 300bp region before the start codon. Before we begin, let’s peek into our object cds:

>>> cds.head()
+--------------+-----------+-----------+--------------+------------------+
|   Chromosome |     Start |       End | Strand       | ID               |
|   (category) |   (int64) |   (int64) | (category)   | (object)         |
|--------------+-----------+-----------+--------------+------------------|
|          1.1 |      7327 |      7606 | +            | cds-CAD5110615.1 |
|          1.1 |     46377 |     47603 | +            | cds-CAD5110625.1 |
|          1.1 |     48406 |     49448 | +            | cds-CAD5110625.1 |
|          1.1 |     46377 |     47603 | +            | cds-CAD5110626.1 |
|          1.1 |     48406 |     48736 | +            | cds-CAD5110626.1 |
|          1.1 |     48839 |     48912 | +            | cds-CAD5110626.1 |
|          1.1 |     60099 |     60409 | +            | cds-CAD5110627.1 |
|          1.1 |     60476 |     60515 | +            | cds-CAD5110627.1 |
+--------------+-----------+-----------+--------------+------------------+
Stranded PyRanges object has 8 rows and 5 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

First, we use the method extend to obtain intervals which include the CDS and the promoter defined as above:

>>> g = cds.extend({'5':300}, group_by='ID')
>>> g.head()
+--------------+-----------+-----------+--------------+------------------+
|   Chromosome |     Start |       End | Strand       | ID               |
|   (category) |   (int64) |   (int64) | (category)   | (object)         |
|--------------+-----------+-----------+--------------+------------------|
|          1.1 |      7027 |      7606 | +            | cds-CAD5110615.1 |
|          1.1 |     46077 |     47603 | +            | cds-CAD5110625.1 |
|          1.1 |     48406 |     49448 | +            | cds-CAD5110625.1 |
|          1.1 |     46077 |     47603 | +            | cds-CAD5110626.1 |
|          1.1 |     48406 |     48736 | +            | cds-CAD5110626.1 |
|          1.1 |     48839 |     48912 | +            | cds-CAD5110626.1 |
|          1.1 |     59799 |     60409 | +            | cds-CAD5110627.1 |
|          1.1 |     60476 |     60515 | +            | cds-CAD5110627.1 |
+--------------+-----------+-----------+--------------+------------------+
Stranded PyRanges object has 8 rows and 5 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

The first argument ensures that the 300bp extension is applied only at the 5’ (left side for + strand intervals, right side for - strand intervals). Through the group_by argument, we request one extension per CDS, instead of extending every interval. In the object we obtained, the promoter corresponds to the first 300 bp of every interval group. We can use method spliced_subsequence again to get it:

>>> prom = g.spliced_subsequence(0, 300, 'ID')
>>> prom.head()
+--------------+-----------+-----------+--------------+------------------+
|   Chromosome |     Start |       End | Strand       | ID               |
|   (category) |   (int64) |   (int64) | (category)   | (object)         |
|--------------+-----------+-----------+--------------+------------------|
|          1.1 |      7027 |      7327 | +            | cds-CAD5110615.1 |
|          1.1 |     46077 |     46377 | +            | cds-CAD5110625.1 |
|          1.1 |     46077 |     46377 | +            | cds-CAD5110626.1 |
|          1.1 |     59799 |     60099 | +            | cds-CAD5110627.1 |
|          1.1 |     59799 |     60099 | +            | cds-CAD5110628.1 |
|          1.1 |     72099 |     72399 | +            | cds-CAD5110629.1 |
|          1.1 |     75736 |     76036 | +            | cds-CAD5110630.1 |
|          1.1 |     79997 |     80297 | +            | cds-CAD5110631.1 |
+--------------+-----------+-----------+--------------+------------------+
Stranded PyRanges object has 8 rows and 5 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Because we extended intervals, some may have gone out-of-bounds on the left or on the right side: they may have a Start smaller than 0, or an End greater than the length of its chromosome, respectively. Indeed, we see there are cases of the first type:

>>> prom[prom.Start<0]
+--------------+-----------+-----------+--------------+---------------------+
|   Chromosome |     Start |       End | Strand       | ID                  |
|   (category) |   (int64) |   (int64) | (category)   | (object)            |
|--------------+-----------+-----------+--------------+---------------------|
|         12.1 |      -129 |       171 | +            | cds-DGYR_LOCUS8189  |
|         18.1 |       -12 |       288 | +            | cds-DGYR_LOCUS10365 |
|         48.1 |      -118 |       182 | +            | cds-CAD5126431.1    |
|         78.1 |      -299 |         1 | +            | cds-CAD5126743.1    |
+--------------+-----------+-----------+--------------+---------------------+
Stranded PyRanges object has 4 rows and 5 columns from 4 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

The function genome_bounds in submodule genomicfeatures is designed to correct this. We may use it to remove out-of-bounds intervals, or to retain only their in-bound portions. We go for the second option, with clip=True:

>>> import pyfaidx
>>> pyf=pyfaidx.Fasta('Dgyro_genome.fa')
>>> cor_prom = pr.genomicfeatures.genome_bounds(prom,
...               chromsizes=pyf,
...               clip=True)

To detect cases of out-of-bounds on the right side, function genome_bounds needs to know chromosome sizes. Various input types are accepted for the chromsizes argument; we used a pyfaidx.Fasta object, which derives it from a fasta file.

The intervals above (and also the right-side out-of-bounds, though we don’t inspect them) have been corrected:

>>> outofbounds_left=prom[prom.Start<0].ID
>>> cor_prom[cor_prom.ID.isin(outofbounds_left)]
+--------------+-----------+-----------+--------------+---------------------+
|   Chromosome |     Start |       End | Strand       | ID                  |
|   (category) |   (int64) |   (int64) | (category)   | (object)            |
|--------------+-----------+-----------+--------------+---------------------|
|         12.1 |         0 |       171 | +            | cds-DGYR_LOCUS8189  |
|         18.1 |         0 |       288 | +            | cds-DGYR_LOCUS10365 |
|         48.1 |         0 |       182 | +            | cds-CAD5126431.1    |
|         78.1 |         0 |         1 | +            | cds-CAD5126743.1    |
+--------------+-----------+-----------+--------------+---------------------+
Stranded PyRanges object has 4 rows and 5 columns from 4 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Detecting overlaps among intervals

Let’s see if any of the promoter regions overlap other CDSs. Pyranges offers many efficient methods to detect overlaps, such as overlap:

>>> cor_prom.overlap(cds, strandedness=False)
+--------------+-----------+-----------+--------------+------------------+
| Chromosome   | Start     | End       | Strand       | ID               |
| (category)   | (int64)   | (int64)   | (category)   | (object)         |
|--------------+-----------+-----------+--------------+------------------|
| 0001.1       | 320232    | 320532    | +            | cds-CAD5110677.1 |
| 0001.1       | 371611    | 371911    | +            | cds-CAD5110692.1 |
| 0001.1       | 434425    | 434725    | +            | cds-CAD5110703.1 |
| 0001.1       | 445177    | 445477    | +            | cds-CAD5110709.1 |
| ...          | ...       | ...       | ...          | ...              |
| 0111.1       | 42019     | 42319     | -            | cds-CAD5126964.1 |
| 0114.1       | 4745      | 5045      | -            | cds-CAD5126973.1 |
| 0117.1       | 12844     | 13144     | +            | cds-CAD5126983.1 |
| 0117.1       | 38697     | 38997     | -            | cds-CAD5126989.1 |
+--------------+-----------+-----------+--------------+------------------+
Stranded PyRanges object has 2,374 rows and 5 columns from 61 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

By default, this method reports intervals in the self pyranges object (i.e., cor_prom) that have at least 1bp of overlap with the other pyranges (i.e., cds). By invoking strandedness=False, we included overlaps even between intervals on opposite strands.

There are many promoters overlapping CDSs. Let’s get the overlapping regions only, using function intersect:

>>> prom_in_cds = cor_prom.intersect(cds, strandedness=False)
>>> prom_in_cds
+--------------+-----------+-----------+--------------+------------------+
| Chromosome   | Start     | End       | Strand       | ID               |
| (category)   | (int64)   | (int64)   | (category)   | (object)         |
|--------------+-----------+-----------+--------------+------------------|
| 0001.1       | 320529    | 320532    | +            | cds-CAD5110677.1 |
| 0001.1       | 371611    | 371744    | +            | cds-CAD5110692.1 |
| 0001.1       | 371611    | 371744    | +            | cds-CAD5110692.1 |
| 0001.1       | 371870    | 371911    | +            | cds-CAD5110692.1 |
| ...          | ...       | ...       | ...          | ...              |
| 0114.1       | 4910      | 5045      | -            | cds-CAD5126973.1 |
| 0117.1       | 13016     | 13019     | +            | cds-CAD5126983.1 |
| 0117.1       | 13080     | 13144     | +            | cds-CAD5126983.1 |
| 0117.1       | 38911     | 38997     | -            | cds-CAD5126989.1 |
+--------------+-----------+-----------+--------------+------------------+
Stranded PyRanges object has 4,185 rows and 5 columns from 61 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

intersect returned more rows than overlap. This is because an interval in cor_prom may overlap multiple intervals in cds, potentially with different intersection regions (compare the 2nd, 3rd and 4th rows, which are all subregions of the same promoter). Therefore, intersect returns one row for each pair of overlapping intervals, while overlap always returns a subset of rows from the self pyranges object, unaltered.

We want to remove redundancy in the object above. We use the method merge to fuse intervals that have some overlap (and the same ID value):

>>> prom_in_cds = prom_in_cds.merge(by='ID')
>>> prom_in_cds
+--------------+-----------+-----------+--------------+------------------+
| Chromosome   | Start     | End       | Strand       | ID               |
| (category)   | (int64)   | (int64)   | (category)   | (object)         |
|--------------+-----------+-----------+--------------+------------------|
| 0001.1       | 320529    | 320532    | +            | cds-CAD5110677.1 |
| 0001.1       | 371611    | 371744    | +            | cds-CAD5110692.1 |
| 0001.1       | 371870    | 371911    | +            | cds-CAD5110692.1 |
| 0001.1       | 434527    | 434725    | +            | cds-CAD5110703.1 |
| ...          | ...       | ...       | ...          | ...              |
| 0114.1       | 4910      | 5045      | -            | cds-CAD5126973.1 |
| 0117.1       | 13016     | 13019     | +            | cds-CAD5126983.1 |
| 0117.1       | 13080     | 13144     | +            | cds-CAD5126983.1 |
| 0117.1       | 38911     | 38997     | -            | cds-CAD5126989.1 |
+--------------+-----------+-----------+--------------+------------------+
Stranded PyRanges object has 3,040 rows and 5 columns from 61 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Note that with overlap or intersect, we do not keep track of the coordinates of overlapping intervals in the second object (cds), as we only obtain those in the first object (cor_prom). For that task, check methods cluster and join (not shown here).

We now want to calculate how long the overlapping regions are. We create a new column named Length by subtracting Start from End. This operation is performed element-wise (the 1st value of End minus the 1st value of Start, the 2nd End minus the 2nd Start, etc), a common paradigm of pandas Series.

>>> prom_in_cds.Length = prom_in_cds.End - prom_in_cds.Start
>>> prom_in_cds
+--------------+-----------+-----------+--------------+------------------+-----------+
| Chromosome   | Start     | End       | Strand       | ID               | Length    |
| (category)   | (int64)   | (int64)   | (category)   | (object)         | (int64)   |
|--------------+-----------+-----------+--------------+------------------+-----------|
| 0001.1       | 320529    | 320532    | +            | cds-CAD5110677.1 | 3         |
| 0001.1       | 371611    | 371744    | +            | cds-CAD5110692.1 | 133       |
| 0001.1       | 371870    | 371911    | +            | cds-CAD5110692.1 | 41        |
| 0001.1       | 434527    | 434725    | +            | cds-CAD5110703.1 | 198       |
| ...          | ...       | ...       | ...          | ...              | ...       |
| 0114.1       | 4910      | 5045      | -            | cds-CAD5126973.1 | 135       |
| 0117.1       | 13016     | 13019     | +            | cds-CAD5126983.1 | 3         |
| 0117.1       | 13080     | 13144     | +            | cds-CAD5126983.1 | 64        |
| 0117.1       | 38911     | 38997     | -            | cds-CAD5126989.1 | 86        |
+--------------+-----------+-----------+--------------+------------------+-----------+
Stranded PyRanges object has 3,040 rows and 6 columns from 61 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Pandas vs Pyranges

It is convenient to think of PyRanges objects as pandas DataFrames decorated with convenient methods for genomic analyses. As seen above, PyRanges offers an interface analogous to DataFrame for data access and input/output, and it is similar when printed. Also, the columns of both object types are pandas Series.

Yet, PyRanges is not implemented as a subclass of DataFrame, as we will see shortly, so that it does not offer all its methods. When in need of a pandas functionality missing in PyRanges, you can easily obtain a DataFrame version of it with property df (a shortcut for method as_df). Note that this copies all data: avoid it if you can stick to PyRanges functions.

>>> prom_in_cds.df
     Chromosome   Start     End Strand                ID  Length
0        0001.1  320529  320532      +  cds-CAD5110677.1       3
1        0001.1  371611  371744      +  cds-CAD5110692.1     133
2        0001.1  371870  371911      +  cds-CAD5110692.1      41
3        0001.1  434527  434725      +  cds-CAD5110703.1     198
4        0001.1  445384  445477      +  cds-CAD5110709.1      93
...         ...     ...     ...    ...               ...     ...
3035     0111.1   42092   42201      -  cds-CAD5126964.1     109
3036     0114.1    4910    5045      -  cds-CAD5126973.1     135
3037     0117.1   13016   13019      +  cds-CAD5126983.1       3
3038     0117.1   13080   13144      +  cds-CAD5126983.1      64
3039     0117.1   38911   38997      -  cds-CAD5126989.1      86

[3040 rows x 6 columns]

Let’s use the DataFrame for a groupby operation wherein we get the aggregated length per promoter of regions overlapping a CDS, as pandas Series:

>>> tot_len = prom_in_cds.df.groupby("ID").Length.sum()
>>> tot_len.name = 'Tot_length'
>>> tot_len
ID
cds-CAD5110617.1         73
cds-CAD5110618.1        235
cds-CAD5110619.1        235
cds-CAD5110622.1         73
cds-CAD5110623.1         73
                       ...
cds-DGYR_LOCUS5056       14
cds-DGYR_LOCUS5675       41
cds-DGYR_LOCUS7571-2    234
cds-DGYR_LOCUS9062        4
cds-DGYR_LOCUS980        99
Name: Tot_length, Length: 2374, dtype: int64

Let’s add this new information (how much of a CDS promoter is overlapping a different CDS) to the cds object. Since it is one number per CDS, all intervals with the same ID will have the same Tot_length. This operation corresponds to a database “join”, which is missing from PyRanges functionalities but available as pandas merge:

>>> z = cds.df.merge(tot_len, on='ID', how='left')
>>> z.Tot_length.fillna(0, inplace=True, downcast='infer')
>>> z
       Chromosome  Start    End Strand                ID  Tot_length
0          0001.1   7327   7606      +  cds-CAD5110615.1           0
1          0001.1  46377  47603      +  cds-CAD5110625.1           0
2          0001.1  48406  49448      +  cds-CAD5110625.1           0
3          0001.1  46377  47603      +  cds-CAD5110626.1           0
4          0001.1  48406  48736      +  cds-CAD5110626.1           0
...           ...    ...    ...    ...               ...         ...
100035     0117.1  20172  21261      -  cds-CAD5126985.1           0
100036     0117.1  26816  27881      -  cds-CAD5126988.1           0
100037     0117.1  36183  38697      -  cds-CAD5126989.1          86
100038     0117.1  39309  39450      -  cds-CAD5126990.1           0
100039     0117.1  38911  39256      -  cds-CAD5126990.1           0

[100040 rows x 6 columns]

Only some CDSs have a promoter overlapping another CDS, so we used how=’left’ when calling merge. This retains all rows of cds, introducing NaN values for IDs missing in tot_len. On the next line of code, we filled NaN with zeros.

Now let’s convert the resulting DataFrame z back to PyRanges:

>>> cds = pr.PyRanges(z)
>>> cds
+--------------+-----------+-----------+--------------+------------------+--------------+
| Chromosome   | Start     | End       | Strand       | ID               | Tot_length   |
| (category)   | (int64)   | (int64)   | (category)   | (object)         | (int64)      |
|--------------+-----------+-----------+--------------+------------------+--------------|
| 0001.1       | 7327      | 7606      | +            | cds-CAD5110615.1 | 0            |
| 0001.1       | 46377     | 47603     | +            | cds-CAD5110625.1 | 0            |
| 0001.1       | 48406     | 49448     | +            | cds-CAD5110625.1 | 0            |
| 0001.1       | 46377     | 47603     | +            | cds-CAD5110626.1 | 0            |
| ...          | ...       | ...       | ...          | ...              | ...          |
| 0117.1       | 26816     | 27881     | -            | cds-CAD5126988.1 | 0            |
| 0117.1       | 36183     | 38697     | -            | cds-CAD5126989.1 | 86           |
| 0117.1       | 39309     | 39450     | -            | cds-CAD5126990.1 | 0            |
| 0117.1       | 38911     | 39256     | -            | cds-CAD5126990.1 | 0            |
+--------------+-----------+-----------+--------------+------------------+--------------+
Stranded PyRanges object has 100,040 rows and 6 columns from 117 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Now let’s dig into the differences of PyRanges and DataFrame. Say that we want to order our intervals by Tot_length. We use PyRanges method sort:

>>> srt_cds = cds.sort('Tot_length')
>>> srt_cds
+--------------+-----------+-----------+--------------+------------------+--------------+
| Chromosome   | Start     | End       | Strand       | ID               | Tot_length   |
| (category)   | (int64)   | (int64)   | (category)   | (object)         | (int64)      |
|--------------+-----------+-----------+--------------+------------------+--------------|
| 0001.1       | 7327      | 7606      | +            | cds-CAD5110615.1 | 0            |
| 0001.1       | 2092958   | 2093126   | +            | cds-CAD5111233.1 | 0            |
| 0001.1       | 2093184   | 2093293   | +            | cds-CAD5111233.1 | 0            |
| 0001.1       | 2093350   | 2093443   | +            | cds-CAD5111233.1 | 0            |
| ...          | ...       | ...       | ...          | ...              | ...          |
| 0117.1       | 26816     | 27881     | -            | cds-CAD5126988.1 | 0            |
| 0117.1       | 39309     | 39450     | -            | cds-CAD5126990.1 | 0            |
| 0117.1       | 38911     | 39256     | -            | cds-CAD5126990.1 | 0            |
| 0117.1       | 36183     | 38697     | -            | cds-CAD5126989.1 | 86           |
+--------------+-----------+-----------+--------------+------------------+--------------+
Stranded PyRanges object has 100,040 rows and 6 columns from 117 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

If we sort the analogous DataFrame using pandas sort_values, we see the results does not quite look the same:

>>> cds.df.sort_values('Tot_length')
      Chromosome    Start      End Strand                ID  Tot_length
0         0001.1     7327     7606      +  cds-CAD5110615.1           0
64767     0013.1  1245868  1246381      -  cds-CAD5121003.1           0
64766     0013.1  1246440  1246707      -  cds-CAD5121003.1           0
64765     0013.1  1246768  1246871      -  cds-CAD5121003.1           0
64764     0013.1  1246936  1247072      -  cds-CAD5121003.1           0
...          ...      ...      ...    ...               ...         ...
9380      0002.1   258719   259098      -  cds-CAD5111666.1         300
64403     0013.1   777119   777332      -  cds-CAD5120886.1         300
9381      0002.1   258298   258496      -  cds-CAD5111666.1         300
64404     0013.1   776667   776756      -  cds-CAD5120886.1         300
48837     0009.1   731837   731964      +  cds-CAD5118662.1         300

[100040 rows x 6 columns]

Why is that? Under the hood, each PyRanges object is a collection of DataFrames: data is spread into separate tables, one for each chromosome/strand pair; e.g. there is a DataFrame with coordinates for + intervals on chr1, one for - intervals on chr1, one for + on chr2, one for - on chr2, etc. The user typically does not need to directly access them (but if you do, you can check the dictionary-type PyRanges attribute dfs).

Intervals on different chromosomes have no order relative to each other, and they are never mixed up in the same table. Indeed, when inspecting a PyRanges object, you see the message:

For printing, the PyRanges was sorted on Chromosome and Strand.

PyRanges sort therefore acts on each table independently. Pandas, on the other hand, has no problems mixing up rows corresponding to different chromosomes, which explains the discrepancy seen above.

This leads to another important difference. In pandas, the index is an essential component of the DataFrame, providing the row labels, order, and a tool for data access. In PyRanges objects, there is no index. Their internal tables of course have their own indices, but they are purposely hidden from the user as they are not to be queried or relied upon.

The user should also beware of methods merge and join, which have different meanings. As seen above, in Pandas they are slight variations of database “join”, while in PyRanges they refer to interval manipulation based on genomic overlap.

Method chaining and custom functions

Like Pandas, pyranges support method chaining. Some useful methods in this sense are: pc, which prints the object and returns it for further chaining; subset, which performs row selection; and assign, which adds a column.

We may chain these to obtain at once the CDS subset, add the length of intervals as new column, drop a couple of columns, then print before and after sorting by length:

>>> ( ann.subset(lambda x:x.Feature=='CDS')
... .assign('Length', lambda x:x.End-x.Start)
... .drop(['Parent', 'Feature'])
... .pc()
... .sort('Length')
... .print()
... )
+--------------+-----------+-----------+--------------+------------------+-----------+
| Chromosome   | Start     | End       | Strand       | ID               | Length    |
| (category)   | (int64)   | (int64)   | (category)   | (object)         | (int64)   |
|--------------+-----------+-----------+--------------+------------------+-----------|
| 0001.1       | 7327      | 7606      | +            | cds-CAD5110615.1 | 279       |
| 0001.1       | 46377     | 47603     | +            | cds-CAD5110625.1 | 1226      |
| 0001.1       | 48406     | 49448     | +            | cds-CAD5110625.1 | 1042      |
| 0001.1       | 46377     | 47603     | +            | cds-CAD5110626.1 | 1226      |
| ...          | ...       | ...       | ...          | ...              | ...       |
| 0117.1       | 26816     | 27881     | -            | cds-CAD5126988.1 | 1065      |
| 0117.1       | 36183     | 38697     | -            | cds-CAD5126989.1 | 2514      |
| 0117.1       | 39309     | 39450     | -            | cds-CAD5126990.1 | 141       |
| 0117.1       | 38911     | 39256     | -            | cds-CAD5126990.1 | 345       |
+--------------+-----------+-----------+--------------+------------------+-----------+
Stranded PyRanges object has 100,040 rows and 6 columns from 117 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
+--------------+-----------+-----------+--------------+------------------+-----------+
| Chromosome   | Start     | End       | Strand       | ID               | Length    |
| (category)   | (int64)   | (int64)   | (category)   | (object)         | (int64)   |
|--------------+-----------+-----------+--------------+------------------+-----------|
| 0001.1       | 2789047   | 2789050   | +            | cds-CAD5111421.1 | 3         |
| 0001.1       | 909263    | 909266    | +            | cds-CAD5110841.1 | 3         |
| 0001.1       | 1721311   | 1721314   | +            | cds-CAD5111112.1 | 3         |
| 0001.1       | 651283    | 651286    | +            | cds-CAD5110767.1 | 3         |
| ...          | ...       | ...       | ...          | ...              | ...       |
| 0117.1       | 38911     | 39256     | -            | cds-CAD5126990.1 | 345       |
| 0117.1       | 26816     | 27881     | -            | cds-CAD5126988.1 | 1065      |
| 0117.1       | 20172     | 21261     | -            | cds-CAD5126985.1 | 1089      |
| 0117.1       | 36183     | 38697     | -            | cds-CAD5126989.1 | 2514      |
+--------------+-----------+-----------+--------------+------------------+-----------+
Stranded PyRanges object has 100,040 rows and 6 columns from 117 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

With assign and subset, you provide a function which is applied to each DataFrame in the collection. In both cases, it must return a Series with the same number of rows as the input PyRanges. For subset, it must be a boolean Series, which is used as row selector. Another similar method called apply also exists. Use apply when your function returns a DataFrame which can be converted to a PyRanges object (i.e. containing Chromosome, Start, End, Strand columns).

In the following code, we select only CDS intervals whose center point is within 50 nucleotides from the middle of a chromosome. For this, we precompute a table of chromosome sizes from the object pyf used earlier, and we use pandas merge, through pyranges apply, to join this table with each dataframe.

>>> chromsizes = pd.DataFrame.from_dict(
... {'Chromosome':[k for k,v in pyf.items()],
...  'chromsize':[len(v) for k,v in pyf.items()]}
...    )
>>> (ann.subset(lambda x:x.Feature=='CDS')
... .drop(['Parent', 'Feature', 'ID'])
... .apply(lambda x:x.merge(chromsizes, on='Chromosome'))
... .assign('midpoint', lambda x:(x.End+x.Start)/2)
... .subset(lambda x:abs(x.midpoint - x.chromsize/2)<50)
...  )
+--------------+-----------+-----------+--------------+-------------+-------------+
| Chromosome   | Start     | End       | Strand       | chromsize   | midpoint    |
| (object)     | (int64)   | (int64)   | (category)   | (int64)     | (float64)   |
|--------------+-----------+-----------+--------------+-------------+-------------|
| 0003.1       | 1616314   | 1616368   | +            | 3232594     | 1616341.0   |
| 0005.1       | 2273266   | 2273365   | +            | 4546684     | 2273315.5   |
| 0013.1       | 861154    | 861506    | +            | 1722566     | 861330.0    |
| 0014.1       | 1120851   | 1120981   | -            | 2241898     | 1120916.0   |
| ...          | ...       | ...       | ...          | ...         | ...         |
| 0034.1       | 98757     | 98889     | -            | 197648      | 98823.0     |
| 0057.1       | 47117     | 47335     | +            | 94438       | 47226.0     |
| 0065.1       | 81326     | 81695     | +            | 162924      | 81510.5     |
| 0098.1       | 63943     | 64035     | +            | 127883      | 63989.0     |
+--------------+-----------+-----------+--------------+-------------+-------------+
Stranded PyRanges object has 12 rows and 6 columns from 10 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

A common operation in pandas is group by then apply, i.e. dividing the table in groups and performing certain operations on each group. You can do such operations using subset, assign, or apply, depending on what to do with the result. Remember that the table of each Chromosome/Strand is processed independently.

Let’s use this functionality to get the first (5’-most) exon of each CDS group. We use pyranges sort with the argument ‘5’, which puts intervals in 5’ -> 3’ order, then we apply a groupby+apply pandas chain:

>>> ( ann.subset(lambda x:x.Feature=='CDS').drop(['Parent', 'Feature']).sort('5').apply(lambda x:x.groupby('ID', as_index=False).first()))
+------------------+--------------+-----------+-----------+--------------+
| ID               | Chromosome   | Start     | End       | Strand       |
| (object)         | (category)   | (int64)   | (int64)   | (category)   |
|------------------+--------------+-----------+-----------+--------------|
| cds-CAD5110615.1 | 0001.1       | 7327      | 7606      | +            |
| cds-CAD5110625.1 | 0001.1       | 46377     | 47603     | +            |
| cds-CAD5110626.1 | 0001.1       | 46377     | 47603     | +            |
| cds-CAD5110627.1 | 0001.1       | 60099     | 60409     | +            |
| ...              | ...          | ...       | ...       | ...          |
| cds-CAD5126985.1 | 0117.1       | 20172     | 21261     | -            |
| cds-CAD5126988.1 | 0117.1       | 26816     | 27881     | -            |
| cds-CAD5126989.1 | 0117.1       | 36183     | 38697     | -            |
| cds-CAD5126990.1 | 0117.1       | 39309     | 39450     | -            |
+------------------+--------------+-----------+-----------+--------------+
Stranded PyRanges object has 16,378 rows and 5 columns from 117 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

This concludes our tutorial. The next pages will delve into pyranges functionalities grouped by topic.