Working with sequences

Fetching sequences per interval

A common operation is to fetch the sequences corresponding to the intervals represented in the PyRanges object. Let’s see an example with built-in data.

>>> import pyranges1 as pr
>>> genome_file = pr.example_data.files['ncbi.fasta']
>>> sg = pr.example_data.ncbi_gff
>>> sg = sg[sg.Feature == 'CDS'].get_with_loc_columns('ID')
>>> sg = sg.sort_ranges().head(20).copy()
>>> sg
index    |    Chromosome         Start    End      Strand      ID
int64    |    category           int64    int64    category    str
-------  ---  -----------------  -------  -------  ----------  ----------------
145      |    CAJFCJ010000025.1  3111     3153     -           cds-CAD5125114.1
146      |    CAJFCJ010000025.1  2753     2851     -           cds-CAD5125114.1
135      |    CAJFCJ010000025.1  2753     2755     -           cds-CAD5125115.1
136      |    CAJFCJ010000025.1  2593     2693     -           cds-CAD5125115.1
...      |    ...                ...      ...      ...         ...
41       |    CAJFCJ010000053.1  77280    77395    +           cds-CAD5126496.1
42       |    CAJFCJ010000053.1  77458    77655    +           cds-CAD5126496.1
43       |    CAJFCJ010000053.1  77724    77839    +           cds-CAD5126496.1
44       |    CAJFCJ010000053.1  77899    78770    +           cds-CAD5126496.1
PyRanges with 20 rows, 5 columns, and 1 index columns.
Contains 2 chromosomes and 2 strands.

Function get_sequence takes the path to a fasta file, and returns a Series containing those sequences, with the same index as the PyRanges object. It requires the package pyfaidx (see add-ons installation. get_sequence will treat each interval independently. The Series returned is typically used to create a new column in the PyRanges object:

>>> sg['Sequence'] = sg.get_sequence(genome_file)
>>> sg
index    |    Chromosome         Start    End      Strand      ID                ...
int64    |    category           int64    int64    category    str               ...
-------  ---  -----------------  -------  -------  ----------  ----------------  -----
145      |    CAJFCJ010000025.1  3111     3153     -           cds-CAD5125114.1  ...
146      |    CAJFCJ010000025.1  2753     2851     -           cds-CAD5125114.1  ...
135      |    CAJFCJ010000025.1  2753     2755     -           cds-CAD5125115.1  ...
136      |    CAJFCJ010000025.1  2593     2693     -           cds-CAD5125115.1  ...
...      |    ...                ...      ...      ...         ...               ...
41       |    CAJFCJ010000053.1  77280    77395    +           cds-CAD5126496.1  ...
42       |    CAJFCJ010000053.1  77458    77655    +           cds-CAD5126496.1  ...
43       |    CAJFCJ010000053.1  77724    77839    +           cds-CAD5126496.1  ...
44       |    CAJFCJ010000053.1  77899    78770    +           cds-CAD5126496.1  ...
PyRanges with 20 rows, 6 columns, and 1 index columns. (1 columns not shown: "Sequence").
Contains 2 chromosomes and 2 strands.
>>> sg.Sequence.head()
145           ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTA
146    AGTGGGGAATTATTGACCCTTACCTACGGAGCTTTGGTTGCTCAGC...
135                                                   at
136    ggGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...
147    ggGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...

For intervals on the negative strand, the sequence has been reverse complemented.

The Sequence Series is amenable to string operations, using the pandas .str operator. Let’s convert it to uppercase:

>>> sg['Sequence'] = sg.Sequence.str.upper()
>>> sg.Sequence.head()
145           ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTA
146    AGTGGGGAATTATTGACCCTTACCTACGGAGCTTTGGTTGCTCAGC...
135                                                   AT
136    GGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...
147    GGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...

Fetching sequences per mRNA

Often we’re interested in the sequence of a transcript, i.e. the concatenation of the exons. Using the argument group_by in function get_sequence allows to get just that, by concatenating the sequences of the intervals in the PyRanges object in the correct order:

>>> mrna_seq = sg.get_sequence(genome_file, group_by='ID')
>>> mrna_seq 
ID
cds-CAD5125114.1    ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTAAGTG...
cds-CAD5125115.1    atggGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCA...
cds-CAD5126492.1    ATGAAGATTTTTGctataatatcaatatattttatattatctgAGT...
cds-CAD5126493.1    ATGAAtttctatagaaatttttttaatttaattttttgtattaaag...
cds-CAD5126495.1    ATGGATTTAGACTTTTTTGCTATATGGTCACCAACTGTAAATTTGA...
cds-CAD5126496.1    atggtTATATGCATATTAAGAACTACAGACAACAAAATGAGAATAA...
Name: Sequence, dtype: object

Again, let’s convert to uppercase:

>>> mrna_seq = mrna_seq.str.upper()
>>> mrna_seq 
ID
cds-CAD5125114.1    ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTAAGTG...
cds-CAD5125115.1    ATGGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCA...
cds-CAD5126492.1    ATGAAGATTTTTGCTATAATATCAATATATTTTATATTATCTGAGT...
cds-CAD5126493.1    ATGAATTTCTATAGAAATTTTTTTAATTTAATTTTTTGTATTAAAG...
cds-CAD5126495.1    ATGGATTTAGACTTTTTTGCTATATGGTCACCAACTGTAAATTTGA...
cds-CAD5126496.1    ATGGTTATATGCATATTAAGAACTACAGACAACAAAATGAGAATAA...
Name: Sequence, dtype: object

Filtering by sequence

The Sequence column can be used to filter the PyRanges object. For example, let’s get intervals whose sequence starts with a G:

>>> g_sg= sg[sg.Sequence.str.startswith('G')]
>>> g_sg[ ['ID', 'Sequence'] ] # show only ID and Sequence to allow display
                   ID                                           Sequence
136  cds-CAD5125115.1  GGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...
147  cds-CAD5125114.1  GGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...
138  cds-CAD5125115.1  GTTCAGTTGGAAGTTGAAACGAAAATTGTTCAGGATCAATTAAAAG...
149  cds-CAD5125114.1  GTTCAGTTGGAAGTTGAAACGAAAATTGTTCAGGATCAATTAAAAG...
12   cds-CAD5126492.1  GTCTAAAGGTTTTCGATACTTGTTTCAGTAAGTTTCATACATCAAA...
13   cds-CAD5126492.1  GAATTCTTGGACCAACTACTAGTGCTATATCCGAAACTATATCCAC...
42   cds-CAD5126496.1  GTTAAATGGTAAAAACTATGGTGGAAATCCTATACCTGAAAAAAGT...

We can use a regular expression as a filter. Let’s get those with CC, followed by 1 to 3 characters, then AA:

>>> pat_sg = sg[sg.Sequence.str.contains(r'CCC.{1,3}AA', regex=True)]
>>> pat_sg[ ['ID', 'Sequence'] ] # show only ID and Sequence to allow display
                  ID                                           Sequence
14  cds-CAD5126492.1  TCTCTTGCTTCATTAATGGAAAAATTTGGCTGGAATTTTATTCTAA...
43  cds-CAD5126496.1  TTCCAAATGATGATGCTACCCATAAAATATGTCTCTTTGAAACATT...
44  cds-CAD5126496.1  TGTTTCTGTTGTTGATCCCACTAAGGCATCGGTAGATCTTACTGGC...

Let’s check the mRNA sequence instead. Let’s get the sequences with TTTT in their first 200 nucleotides:

>>> z = mrna_seq[mrna_seq.str[:200].str.contains('TTTT')]
>>> z 
ID
cds-CAD5125115.1    ATGGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCA...
cds-CAD5126492.1    ATGAAGATTTTTGCTATAATATCAATATATTTTATATTATCTGAGT...
cds-CAD5126493.1    ATGAATTTCTATAGAAATTTTTTTAATTTAATTTTTTGTATTAAAG...
cds-CAD5126495.1    ATGGATTTAGACTTTTTTGCTATATGGTCACCAACTGTAAATTTGA...
cds-CAD5126496.1    ATGGTTATATGCATATTAAGAACTACAGACAACAAAATGAGAATAA...
Name: Sequence, dtype: object

Now let’s filter the original PyRanges object to get all interval groups with these IDs:

>>> zsg = sg[sg.ID.isin(z.index)]
>>> zsg
index    |    Chromosome         Start    End      Strand      ID                ...
int64    |    category           int64    int64    category    str               ...
-------  ---  -----------------  -------  -------  ----------  ----------------  -----
135      |    CAJFCJ010000025.1  2753     2755     -           cds-CAD5125115.1  ...
136      |    CAJFCJ010000025.1  2593     2693     -           cds-CAD5125115.1  ...
137      |    CAJFCJ010000025.1  2354     2537     -           cds-CAD5125115.1  ...
138      |    CAJFCJ010000025.1  2174     2294     -           cds-CAD5125115.1  ...
...      |    ...                ...      ...      ...         ...               ...
41       |    CAJFCJ010000053.1  77280    77395    +           cds-CAD5126496.1  ...
42       |    CAJFCJ010000053.1  77458    77655    +           cds-CAD5126496.1  ...
43       |    CAJFCJ010000053.1  77724    77839    +           cds-CAD5126496.1  ...
44       |    CAJFCJ010000053.1  77899    78770    +           cds-CAD5126496.1  ...
PyRanges with 15 rows, 6 columns, and 1 index columns. (1 columns not shown: "Sequence").
Contains 2 chromosomes and 2 strands.

Translation and reverse complement

The submodule pyranges1.seqs contains functions to translate sequences and reverse complement them. They can operate on a single sequence, or a Series of sequences as below:

>>> sg['Protein'] = pr.seqs.translate(sg.Sequence)
>>> sg['RevComp'] = pr.seqs.reverse_complement(sg.Sequence)
>>> sg[ ['ID', 'Sequence', 'Protein', 'RevComp'] ].head(1).transpose()  # to allow display
                                                 145
ID                                  cds-CAD5125114.1
Sequence  ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTA
Protein                               MSRQSGRSNDPRKV
RevComp   TACTTTTCGAGGATCATTTGATCTCCCACTCTGTCGCGACAT