Working with sequences
~~~~~~~~~~~~~~~~~~~~~~

.. contents::
   :local:
   :depth: 2

Fetching sequences per interval
-------------------------------

A common operation is to fetch the sequences corresponding to the intervals
represented in the PyRanges object. Let's see an example with built-in data.

  >>> import pyranges1 as pr
  >>> genome_file = pr.example_data.files['ncbi.fasta']
  >>> sg = pr.example_data.ncbi_gff
  >>> sg = sg[sg.Feature == 'CDS'].get_with_loc_columns('ID')
  >>> sg = sg.sort_ranges().head(20).copy()
  >>> sg
  index    |    Chromosome         Start    End      Strand      ID
  int64    |    category           int64    int64    category    str
  -------  ---  -----------------  -------  -------  ----------  ----------------
  145      |    CAJFCJ010000025.1  3111     3153     -           cds-CAD5125114.1
  146      |    CAJFCJ010000025.1  2753     2851     -           cds-CAD5125114.1
  135      |    CAJFCJ010000025.1  2753     2755     -           cds-CAD5125115.1
  136      |    CAJFCJ010000025.1  2593     2693     -           cds-CAD5125115.1
  ...      |    ...                ...      ...      ...         ...
  41       |    CAJFCJ010000053.1  77280    77395    +           cds-CAD5126496.1
  42       |    CAJFCJ010000053.1  77458    77655    +           cds-CAD5126496.1
  43       |    CAJFCJ010000053.1  77724    77839    +           cds-CAD5126496.1
  44       |    CAJFCJ010000053.1  77899    78770    +           cds-CAD5126496.1
  PyRanges with 20 rows, 5 columns, and 1 index columns.
  Contains 2 chromosomes and 2 strands.

Function :func:`get_sequence <pyranges1.PyRanges.get_sequence>`
takes the path to a fasta file, and returns a Series containing those sequences, with the same index as the PyRanges object.
It requires the package pyfaidx (see :doc:`add-ons installation <./installation>`.
:func:`get_sequence <pyranges1.PyRanges.get_sequence>` will treat each interval independently. The Series returned
is typically used to create a new column in the PyRanges object:

  >>> sg['Sequence'] = sg.get_sequence(genome_file)
  >>> sg
  index    |    Chromosome         Start    End      Strand      ID                ...
  int64    |    category           int64    int64    category    str               ...
  -------  ---  -----------------  -------  -------  ----------  ----------------  -----
  145      |    CAJFCJ010000025.1  3111     3153     -           cds-CAD5125114.1  ...
  146      |    CAJFCJ010000025.1  2753     2851     -           cds-CAD5125114.1  ...
  135      |    CAJFCJ010000025.1  2753     2755     -           cds-CAD5125115.1  ...
  136      |    CAJFCJ010000025.1  2593     2693     -           cds-CAD5125115.1  ...
  ...      |    ...                ...      ...      ...         ...               ...
  41       |    CAJFCJ010000053.1  77280    77395    +           cds-CAD5126496.1  ...
  42       |    CAJFCJ010000053.1  77458    77655    +           cds-CAD5126496.1  ...
  43       |    CAJFCJ010000053.1  77724    77839    +           cds-CAD5126496.1  ...
  44       |    CAJFCJ010000053.1  77899    78770    +           cds-CAD5126496.1  ...
  PyRanges with 20 rows, 6 columns, and 1 index columns. (1 columns not shown: "Sequence").
  Contains 2 chromosomes and 2 strands.

  >>> sg.Sequence.head()
  145           ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTA
  146    AGTGGGGAATTATTGACCCTTACCTACGGAGCTTTGGTTGCTCAGC...
  135                                                   at
  136    ggGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...
  147    ggGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...

For intervals on the negative strand, the sequence has been reverse complemented.

The ``Sequence`` Series is amenable to string operations, using the pandas ``.str`` operator.
Let's convert it to uppercase:

  >>> sg['Sequence'] = sg.Sequence.str.upper()
  >>> sg.Sequence.head()
  145           ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTA
  146    AGTGGGGAATTATTGACCCTTACCTACGGAGCTTTGGTTGCTCAGC...
  135                                                   AT
  136    GGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...
  147    GGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...

Fetching sequences per mRNA
---------------------------

Often we're interested in the sequence of a transcript, i.e. the concatenation of the exons.
Using the argument ``group_by`` in function :func:`get_sequence <pyranges1.PyRanges.get_sequence>` allows to get just
that, by concatenating the sequences of the intervals in the PyRanges object in the correct order:

  >>> mrna_seq = sg.get_sequence(genome_file, group_by='ID')
  >>> mrna_seq # doctest: +NORMALIZE_WHITESPACE
  ID
  cds-CAD5125114.1    ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTAAGTG...
  cds-CAD5125115.1    atggGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCA...
  cds-CAD5126492.1    ATGAAGATTTTTGctataatatcaatatattttatattatctgAGT...
  cds-CAD5126493.1    ATGAAtttctatagaaatttttttaatttaattttttgtattaaag...
  cds-CAD5126495.1    ATGGATTTAGACTTTTTTGCTATATGGTCACCAACTGTAAATTTGA...
  cds-CAD5126496.1    atggtTATATGCATATTAAGAACTACAGACAACAAAATGAGAATAA...
  Name: Sequence, dtype: object

Again, let's convert to uppercase:

  >>> mrna_seq = mrna_seq.str.upper()
  >>> mrna_seq # doctest: +NORMALIZE_WHITESPACE
  ID
  cds-CAD5125114.1    ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTAAGTG...
  cds-CAD5125115.1    ATGGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCA...
  cds-CAD5126492.1    ATGAAGATTTTTGCTATAATATCAATATATTTTATATTATCTGAGT...
  cds-CAD5126493.1    ATGAATTTCTATAGAAATTTTTTTAATTTAATTTTTTGTATTAAAG...
  cds-CAD5126495.1    ATGGATTTAGACTTTTTTGCTATATGGTCACCAACTGTAAATTTGA...
  cds-CAD5126496.1    ATGGTTATATGCATATTAAGAACTACAGACAACAAAATGAGAATAA...
  Name: Sequence, dtype: object

Filtering by sequence
---------------------

The ``Sequence`` column can be used to filter the PyRanges object.
For example, let's get intervals whose sequence starts with a G:

  >>> g_sg= sg[sg.Sequence.str.startswith('G')]
  >>> g_sg[ ['ID', 'Sequence'] ] # show only ID and Sequence to allow display
                     ID                                           Sequence
  136  cds-CAD5125115.1  GGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...
  147  cds-CAD5125114.1  GGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA...
  138  cds-CAD5125115.1  GTTCAGTTGGAAGTTGAAACGAAAATTGTTCAGGATCAATTAAAAG...
  149  cds-CAD5125114.1  GTTCAGTTGGAAGTTGAAACGAAAATTGTTCAGGATCAATTAAAAG...
  12   cds-CAD5126492.1  GTCTAAAGGTTTTCGATACTTGTTTCAGTAAGTTTCATACATCAAA...
  13   cds-CAD5126492.1  GAATTCTTGGACCAACTACTAGTGCTATATCCGAAACTATATCCAC...
  42   cds-CAD5126496.1  GTTAAATGGTAAAAACTATGGTGGAAATCCTATACCTGAAAAAAGT...

We can use a regular expression as a filter. Let's get those with CC, followed by 1 to 3 characters, then AA:

  >>> pat_sg = sg[sg.Sequence.str.contains(r'CCC.{1,3}AA', regex=True)]
  >>> pat_sg[ ['ID', 'Sequence'] ] # show only ID and Sequence to allow display
                    ID                                           Sequence
  14  cds-CAD5126492.1  TCTCTTGCTTCATTAATGGAAAAATTTGGCTGGAATTTTATTCTAA...
  43  cds-CAD5126496.1  TTCCAAATGATGATGCTACCCATAAAATATGTCTCTTTGAAACATT...
  44  cds-CAD5126496.1  TGTTTCTGTTGTTGATCCCACTAAGGCATCGGTAGATCTTACTGGC...

Let's check the mRNA sequence instead. Let's get the sequences with TTTT in their first 200 nucleotides:

  >>> z = mrna_seq[mrna_seq.str[:200].str.contains('TTTT')]
  >>> z # doctest: +NORMALIZE_WHITESPACE
  ID
  cds-CAD5125115.1    ATGGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCA...
  cds-CAD5126492.1    ATGAAGATTTTTGCTATAATATCAATATATTTTATATTATCTGAGT...
  cds-CAD5126493.1    ATGAATTTCTATAGAAATTTTTTTAATTTAATTTTTTGTATTAAAG...
  cds-CAD5126495.1    ATGGATTTAGACTTTTTTGCTATATGGTCACCAACTGTAAATTTGA...
  cds-CAD5126496.1    ATGGTTATATGCATATTAAGAACTACAGACAACAAAATGAGAATAA...
  Name: Sequence, dtype: object

Now let's filter the original PyRanges object to get all interval groups with these IDs:

  >>> zsg = sg[sg.ID.isin(z.index)]
  >>> zsg
  index    |    Chromosome         Start    End      Strand      ID                ...
  int64    |    category           int64    int64    category    str               ...
  -------  ---  -----------------  -------  -------  ----------  ----------------  -----
  135      |    CAJFCJ010000025.1  2753     2755     -           cds-CAD5125115.1  ...
  136      |    CAJFCJ010000025.1  2593     2693     -           cds-CAD5125115.1  ...
  137      |    CAJFCJ010000025.1  2354     2537     -           cds-CAD5125115.1  ...
  138      |    CAJFCJ010000025.1  2174     2294     -           cds-CAD5125115.1  ...
  ...      |    ...                ...      ...      ...         ...               ...
  41       |    CAJFCJ010000053.1  77280    77395    +           cds-CAD5126496.1  ...
  42       |    CAJFCJ010000053.1  77458    77655    +           cds-CAD5126496.1  ...
  43       |    CAJFCJ010000053.1  77724    77839    +           cds-CAD5126496.1  ...
  44       |    CAJFCJ010000053.1  77899    78770    +           cds-CAD5126496.1  ...
  PyRanges with 15 rows, 6 columns, and 1 index columns. (1 columns not shown: "Sequence").
  Contains 2 chromosomes and 2 strands.

Translation and reverse complement
----------------------------------

The submodule :mod:`pyranges1.seqs` contains functions to translate sequences and reverse complement them.
They can operate on a single sequence, or a Series of sequences as below:

  >>> sg['Protein'] = pr.seqs.translate(sg.Sequence)
  >>> sg['RevComp'] = pr.seqs.reverse_complement(sg.Sequence)
  >>> sg[ ['ID', 'Sequence', 'Protein', 'RevComp'] ].head(1).transpose()  # to allow display
                                                   145
  ID                                  cds-CAD5125114.1
  Sequence  ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTA
  Protein                               MSRQSGRSNDPRKV
  RevComp   TACTTTTCGAGGATCATTTGATCTCCCACTCTGTCGCGACAT