Working with sequences
Fetching sequences per interval
A common operation is to fetch the sequences corresponding to the intervals represented in the PyRanges object. Let’s see an example with built-in data.
>>> import pyranges1 as pr
>>> genome_file = pr.example_data.files['ncbi.fasta']
>>> sg = pr.example_data.ncbi_gff
>>> sg = sg[sg.Feature == 'CDS'].get_with_loc_columns('ID')
>>> sg = sg.sort_ranges().head(20).copy()
>>> sg
index | Chromosome Start End Strand ID
int64 | category int64 int64 category str
------- --- ----------------- ------- ------- ---------- ----------------
145 | CAJFCJ010000025.1 3111 3153 - cds-CAD5125114.1
146 | CAJFCJ010000025.1 2753 2851 - cds-CAD5125114.1
135 | CAJFCJ010000025.1 2753 2755 - cds-CAD5125115.1
136 | CAJFCJ010000025.1 2593 2693 - cds-CAD5125115.1
... | ... ... ... ... ...
41 | CAJFCJ010000053.1 77280 77395 + cds-CAD5126496.1
42 | CAJFCJ010000053.1 77458 77655 + cds-CAD5126496.1
43 | CAJFCJ010000053.1 77724 77839 + cds-CAD5126496.1
44 | CAJFCJ010000053.1 77899 78770 + cds-CAD5126496.1
PyRanges with 20 rows, 5 columns, and 1 index columns.
Contains 2 chromosomes and 2 strands.
Function get_sequence
takes the path to a fasta file, and returns a Series containing those sequences, with the same index as the PyRanges object.
It requires the package pyfaidx (see add-ons installation.
get_sequence will treat each interval independently. The Series returned
is typically used to create a new column in the PyRanges object:
>>> sg['Sequence'] = sg.get_sequence(genome_file)
>>> sg
index | Chromosome Start End Strand ID ...
int64 | category int64 int64 category str ...
------- --- ----------------- ------- ------- ---------- ---------------- -----
145 | CAJFCJ010000025.1 3111 3153 - cds-CAD5125114.1 ...
146 | CAJFCJ010000025.1 2753 2851 - cds-CAD5125114.1 ...
135 | CAJFCJ010000025.1 2753 2755 - cds-CAD5125115.1 ...
136 | CAJFCJ010000025.1 2593 2693 - cds-CAD5125115.1 ...
... | ... ... ... ... ... ...
41 | CAJFCJ010000053.1 77280 77395 + cds-CAD5126496.1 ...
42 | CAJFCJ010000053.1 77458 77655 + cds-CAD5126496.1 ...
43 | CAJFCJ010000053.1 77724 77839 + cds-CAD5126496.1 ...
44 | CAJFCJ010000053.1 77899 78770 + cds-CAD5126496.1 ...
PyRanges with 20 rows, 6 columns, and 1 index columns. (1 columns not shown: "Sequence").
Contains 2 chromosomes and 2 strands.
>>> sg.Sequence.head()
145 ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTA
146 AGTGGGGAATTATTGACCCTTACCTACGGAGCTTTGGTTGCTCAGC...
135 at
136 ggGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...
147 ggGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...
For intervals on the negative strand, the sequence has been reverse complemented.
The Sequence Series is amenable to string operations, using the pandas .str operator.
Let’s convert it to uppercase:
>>> sg['Sequence'] = sg.Sequence.str.upper()
>>> sg.Sequence.head()
145 ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTA
146 AGTGGGGAATTATTGACCCTTACCTACGGAGCTTTGGTTGCTCAGC...
135 AT
136 GGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...
147 GGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...
Fetching sequences per mRNA
Often we’re interested in the sequence of a transcript, i.e. the concatenation of the exons.
Using the argument group_by in function get_sequence allows to get just
that, by concatenating the sequences of the intervals in the PyRanges object in the correct order:
>>> mrna_seq = sg.get_sequence(genome_file, group_by='ID')
>>> mrna_seq
ID
cds-CAD5125114.1 ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTAAGTG...
cds-CAD5125115.1 atggGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCA...
cds-CAD5126492.1 ATGAAGATTTTTGctataatatcaatatattttatattatctgAGT...
cds-CAD5126493.1 ATGAAtttctatagaaatttttttaatttaattttttgtattaaag...
cds-CAD5126495.1 ATGGATTTAGACTTTTTTGCTATATGGTCACCAACTGTAAATTTGA...
cds-CAD5126496.1 atggtTATATGCATATTAAGAACTACAGACAACAAAATGAGAATAA...
Name: Sequence, dtype: object
Again, let’s convert to uppercase:
>>> mrna_seq = mrna_seq.str.upper()
>>> mrna_seq
ID
cds-CAD5125114.1 ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTAAGTG...
cds-CAD5125115.1 ATGGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCA...
cds-CAD5126492.1 ATGAAGATTTTTGCTATAATATCAATATATTTTATATTATCTGAGT...
cds-CAD5126493.1 ATGAATTTCTATAGAAATTTTTTTAATTTAATTTTTTGTATTAAAG...
cds-CAD5126495.1 ATGGATTTAGACTTTTTTGCTATATGGTCACCAACTGTAAATTTGA...
cds-CAD5126496.1 ATGGTTATATGCATATTAAGAACTACAGACAACAAAATGAGAATAA...
Name: Sequence, dtype: object
Filtering by sequence
The Sequence column can be used to filter the PyRanges object.
For example, let’s get intervals whose sequence starts with a G:
>>> g_sg= sg[sg.Sequence.str.startswith('G')]
>>> g_sg[ ['ID', 'Sequence'] ] # show only ID and Sequence to allow display
ID Sequence
136 cds-CAD5125115.1 GGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...
147 cds-CAD5125114.1 GGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...
138 cds-CAD5125115.1 GTTCAGTTGGAAGTTGAAACGAAAATTGTTCAGGATCAATTAAAAG...
149 cds-CAD5125114.1 GTTCAGTTGGAAGTTGAAACGAAAATTGTTCAGGATCAATTAAAAG...
12 cds-CAD5126492.1 GTCTAAAGGTTTTCGATACTTGTTTCAGTAAGTTTCATACATCAAA...
13 cds-CAD5126492.1 GAATTCTTGGACCAACTACTAGTGCTATATCCGAAACTATATCCAC...
42 cds-CAD5126496.1 GTTAAATGGTAAAAACTATGGTGGAAATCCTATACCTGAAAAAAGT...
We can use a regular expression as a filter. Let’s get those with CC, followed by 1 to 3 characters, then AA:
>>> pat_sg = sg[sg.Sequence.str.contains(r'CCC.{1,3}AA', regex=True)]
>>> pat_sg[ ['ID', 'Sequence'] ] # show only ID and Sequence to allow display
ID Sequence
14 cds-CAD5126492.1 TCTCTTGCTTCATTAATGGAAAAATTTGGCTGGAATTTTATTCTAA...
43 cds-CAD5126496.1 TTCCAAATGATGATGCTACCCATAAAATATGTCTCTTTGAAACATT...
44 cds-CAD5126496.1 TGTTTCTGTTGTTGATCCCACTAAGGCATCGGTAGATCTTACTGGC...
Let’s check the mRNA sequence instead. Let’s get the sequences with TTTT in their first 200 nucleotides:
>>> z = mrna_seq[mrna_seq.str[:200].str.contains('TTTT')]
>>> z
ID
cds-CAD5125115.1 ATGGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCA...
cds-CAD5126492.1 ATGAAGATTTTTGCTATAATATCAATATATTTTATATTATCTGAGT...
cds-CAD5126493.1 ATGAATTTCTATAGAAATTTTTTTAATTTAATTTTTTGTATTAAAG...
cds-CAD5126495.1 ATGGATTTAGACTTTTTTGCTATATGGTCACCAACTGTAAATTTGA...
cds-CAD5126496.1 ATGGTTATATGCATATTAAGAACTACAGACAACAAAATGAGAATAA...
Name: Sequence, dtype: object
Now let’s filter the original PyRanges object to get all interval groups with these IDs:
>>> zsg = sg[sg.ID.isin(z.index)]
>>> zsg
index | Chromosome Start End Strand ID ...
int64 | category int64 int64 category str ...
------- --- ----------------- ------- ------- ---------- ---------------- -----
135 | CAJFCJ010000025.1 2753 2755 - cds-CAD5125115.1 ...
136 | CAJFCJ010000025.1 2593 2693 - cds-CAD5125115.1 ...
137 | CAJFCJ010000025.1 2354 2537 - cds-CAD5125115.1 ...
138 | CAJFCJ010000025.1 2174 2294 - cds-CAD5125115.1 ...
... | ... ... ... ... ... ...
41 | CAJFCJ010000053.1 77280 77395 + cds-CAD5126496.1 ...
42 | CAJFCJ010000053.1 77458 77655 + cds-CAD5126496.1 ...
43 | CAJFCJ010000053.1 77724 77839 + cds-CAD5126496.1 ...
44 | CAJFCJ010000053.1 77899 78770 + cds-CAD5126496.1 ...
PyRanges with 15 rows, 6 columns, and 1 index columns. (1 columns not shown: "Sequence").
Contains 2 chromosomes and 2 strands.
Translation and reverse complement
The submodule pyranges1.seqs contains functions to translate sequences and reverse complement them.
They can operate on a single sequence, or a Series of sequences as below:
>>> sg['Protein'] = pr.seqs.translate(sg.Sequence)
>>> sg['RevComp'] = pr.seqs.reverse_complement(sg.Sequence)
>>> sg[ ['ID', 'Sequence', 'Protein', 'RevComp'] ].head(1).transpose() # to allow display
145
ID cds-CAD5125114.1
Sequence ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTA
Protein MSRQSGRSNDPRKV
RevComp TACTTTTCGAGGATCATTTGATCTCCCACTCTGTCGCGACAT