Working with sequences

Fetching sequences per interval 

A common operation is to fetch the sequences corresponding to the intervals represented in the PyRanges object. Let’s see an example with built-in data.

>>> import pyranges1 as pr
>>> genome_file = pr.example_data.files['ncbi.fasta']
>>> sg = pr.example_data.ncbi_gff
>>> sg = sg[sg.Feature == 'CDS'].get_with_loc_columns('ID')
>>> sg = sg.sort_ranges().head(20).copy()
>>> sg
index    |    Chromosome         Start    End      Strand      ID
int64    |    category           int64    int64    category    str
-------  ---  -----------------  -------  -------  ----------  ----------------
145      |    CAJFCJ010000025.1  3111     3153     -           cds-CAD5125114.1
146      |    CAJFCJ010000025.1  2753     2851     -           cds-CAD5125114.1
135      |    CAJFCJ010000025.1  2753     2755     -           cds-CAD5125115.1
136      |    CAJFCJ010000025.1  2593     2693     -           cds-CAD5125115.1
...      |    ...                ...      ...      ...         ...
41       |    CAJFCJ010000053.1  77280    77395    +           cds-CAD5126496.1
42       |    CAJFCJ010000053.1  77458    77655    +           cds-CAD5126496.1
43       |    CAJFCJ010000053.1  77724    77839    +           cds-CAD5126496.1
44       |    CAJFCJ010000053.1  77899    78770    +           cds-CAD5126496.1
PyRanges with 20 rows, 5 columns, and 1 index columns.
Contains 2 chromosomes and 2 strands.

Function get_sequence takes the path to a fasta file, and returns a Series containing those sequences, with the same index as the PyRanges object. It requires the package pyfaidx (see add-ons installation. get_sequence will treat each interval independently. The Series returned is typically used to create a new column in the PyRanges object:

>>> sg['Sequence'] = sg.get_sequence(genome_file)
>>> sg
index    |    Chromosome         Start    End      Strand      ID                ...
int64    |    category           int64    int64    category    str               ...
-------  ---  -----------------  -------  -------  ----------  ----------------  -----
145      |    CAJFCJ010000025.1  3111     3153     -           cds-CAD5125114.1  ...
146      |    CAJFCJ010000025.1  2753     2851     -           cds-CAD5125114.1  ...
135      |    CAJFCJ010000025.1  2753     2755     -           cds-CAD5125115.1  ...
136      |    CAJFCJ010000025.1  2593     2693     -           cds-CAD5125115.1  ...
...      |    ...                ...      ...      ...         ...               ...
41       |    CAJFCJ010000053.1  77280    77395    +           cds-CAD5126496.1  ...
42       |    CAJFCJ010000053.1  77458    77655    +           cds-CAD5126496.1  ...
43       |    CAJFCJ010000053.1  77724    77839    +           cds-CAD5126496.1  ...
44       |    CAJFCJ010000053.1  77899    78770    +           cds-CAD5126496.1  ...
PyRanges with 20 rows, 6 columns, and 1 index columns. (1 columns not shown: "Sequence").
Contains 2 chromosomes and 2 strands.

>>> sg.Sequence.head()
         ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTA
  AGTGGGGAATTATTGACCCTTACCTACGGAGCTTTGGTTGCTCAGC...
                                                 at
  ggGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...
  ggGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...

For intervals on the negative strand, the sequence has been reverse complemented.

The Sequence Series is amenable to string operations, using the pandas .str operator. Let’s convert it to uppercase:

>>> sg['Sequence'] = sg.Sequence.str.upper()
>>> sg.Sequence.head()
         ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTA
  AGTGGGGAATTATTGACCCTTACCTACGGAGCTTTGGTTGCTCAGC...
                                                 AT
  GGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...
  GGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...

Fetching sequences per mRNA 

Often we’re interested in the sequence of a transcript, i.e. the concatenation of the exons. Using the argument group_by in function get_sequence allows to get just that, by concatenating the sequences of the intervals in the PyRanges object in the correct order:

>>> mrna_seq = sg.get_sequence(genome_file, group_by='ID')
>>> mrna_seq 
ID
cds-CAD5125114.1    ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTAAGTG...
cds-CAD5125115.1    atggGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCA...
cds-CAD5126492.1    ATGAAGATTTTTGctataatatcaatatattttatattatctgAGT...
cds-CAD5126493.1    ATGAAtttctatagaaatttttttaatttaattttttgtattaaag...
cds-CAD5126495.1    ATGGATTTAGACTTTTTTGCTATATGGTCACCAACTGTAAATTTGA...
cds-CAD5126496.1    atggtTATATGCATATTAAGAACTACAGACAACAAAATGAGAATAA...
Name: Sequence, dtype: object

Again, let’s convert to uppercase:

>>> mrna_seq = mrna_seq.str.upper()
>>> mrna_seq 
ID
cds-CAD5125114.1    ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTAAGTG...
cds-CAD5125115.1    ATGGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCA...
cds-CAD5126492.1    ATGAAGATTTTTGCTATAATATCAATATATTTTATATTATCTGAGT...
cds-CAD5126493.1    ATGAATTTCTATAGAAATTTTTTTAATTTAATTTTTTGTATTAAAG...
cds-CAD5126495.1    ATGGATTTAGACTTTTTTGCTATATGGTCACCAACTGTAAATTTGA...
cds-CAD5126496.1    ATGGTTATATGCATATTAAGAACTACAGACAACAAAATGAGAATAA...
Name: Sequence, dtype: object

Filtering by sequence 

The Sequence column can be used to filter the PyRanges object. For example, let’s get intervals whose sequence starts with a G:

>>> g_sg= sg[sg.Sequence.str.startswith('G')]
>>> g_sg[ ['ID', 'Sequence'] ] # show only ID and Sequence to allow display
                   ID                                           Sequence
cds-CAD5125115.1  GGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...
cds-CAD5125114.1  GGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...
cds-CAD5125115.1  GTTCAGTTGGAAGTTGAAACGAAAATTGTTCAGGATCAATTAAAAG...
cds-CAD5125114.1  GTTCAGTTGGAAGTTGAAACGAAAATTGTTCAGGATCAATTAAAAG...
 cds-CAD5126492.1  GTCTAAAGGTTTTCGATACTTGTTTCAGTAAGTTTCATACATCAAA...
 cds-CAD5126492.1  GAATTCTTGGACCAACTACTAGTGCTATATCCGAAACTATATCCAC...
 cds-CAD5126496.1  GTTAAATGGTAAAAACTATGGTGGAAATCCTATACCTGAAAAAAGT...

We can use a regular expression as a filter. Let’s get those with CC, followed by 1 to 3 characters, then AA:

>>> pat_sg = sg[sg.Sequence.str.contains(r'CCC.{1,3}AA', regex=True)]
>>> pat_sg[ ['ID', 'Sequence'] ] # show only ID and Sequence to allow display
                  ID                                           Sequence
14  cds-CAD5126492.1  TCTCTTGCTTCATTAATGGAAAAATTTGGCTGGAATTTTATTCTAA...
43  cds-CAD5126496.1  TTCCAAATGATGATGCTACCCATAAAATATGTCTCTTTGAAACATT...
44  cds-CAD5126496.1  TGTTTCTGTTGTTGATCCCACTAAGGCATCGGTAGATCTTACTGGC...

Let’s check the mRNA sequence instead. Let’s get the sequences with TTTT in their first 200 nucleotides:

>>> z = mrna_seq[mrna_seq.str[:200].str.contains('TTTT')]
>>> z 
ID
cds-CAD5125115.1    ATGGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCA...
cds-CAD5126492.1    ATGAAGATTTTTGCTATAATATCAATATATTTTATATTATCTGAGT...
cds-CAD5126493.1    ATGAATTTCTATAGAAATTTTTTTAATTTAATTTTTTGTATTAAAG...
cds-CAD5126495.1    ATGGATTTAGACTTTTTTGCTATATGGTCACCAACTGTAAATTTGA...
cds-CAD5126496.1    ATGGTTATATGCATATTAAGAACTACAGACAACAAAATGAGAATAA...
Name: Sequence, dtype: object

Now let’s filter the original PyRanges object to get all interval groups with these IDs:

>>> zsg = sg[sg.ID.isin(z.index)]
>>> zsg
index    |    Chromosome         Start    End      Strand      ID                ...
int64    |    category           int64    int64    category    str               ...
-------  ---  -----------------  -------  -------  ----------  ----------------  -----
135      |    CAJFCJ010000025.1  2753     2755     -           cds-CAD5125115.1  ...
136      |    CAJFCJ010000025.1  2593     2693     -           cds-CAD5125115.1  ...
137      |    CAJFCJ010000025.1  2354     2537     -           cds-CAD5125115.1  ...
138      |    CAJFCJ010000025.1  2174     2294     -           cds-CAD5125115.1  ...
...      |    ...                ...      ...      ...         ...               ...
41       |    CAJFCJ010000053.1  77280    77395    +           cds-CAD5126496.1  ...
42       |    CAJFCJ010000053.1  77458    77655    +           cds-CAD5126496.1  ...
43       |    CAJFCJ010000053.1  77724    77839    +           cds-CAD5126496.1  ...
44       |    CAJFCJ010000053.1  77899    78770    +           cds-CAD5126496.1  ...
PyRanges with 15 rows, 6 columns, and 1 index columns. (1 columns not shown: "Sequence").
Contains 2 chromosomes and 2 strands.

Translation and reverse complement 

The submodule pyranges1.seqs contains functions to translate sequences and reverse complement them. They can operate on a single sequence, or a Series of sequences as below:

>>> sg['Protein'] = pr.seqs.translate(sg.Sequence)
>>> sg['RevComp'] = pr.seqs.reverse_complement(sg.Sequence)
>>> sg[ ['ID', 'Sequence', 'Protein', 'RevComp'] ].head(1).transpose()  # to allow display
                                                 145
ID                                  cds-CAD5125114.1
Sequence  ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTA
Protein                               MSRQSGRSNDPRKV
RevComp   TACTTTTCGAGGATCATTTGATCTCCCACTCTGTCGCGACAT

Working with sequences

Fetching sequences per interval

Fetching sequences per mRNA

Filtering by sequence

Translation and reverse complement

Fetching sequences per interval 

Fetching sequences per mRNA 

Filtering by sequence 

Translation and reverse complement 