:py:mod:`pyranges.get_fasta`
============================

.. py:module:: pyranges.get_fasta


Module Contents
---------------


Functions
~~~~~~~~~

.. autoapisummary::

   pyranges.get_fasta.get_sequence
   pyranges.get_fasta.get_fasta
   pyranges.get_fasta.get_transcript_sequence


.. py:function:: get_sequence(gr, path=None, pyfaidx_fasta=None)

   Get the sequence of the intervals from a fasta file

   :param gr: Coordinates.
   :type gr: PyRanges
   :param path: Path to fasta file. It will be indexed using pyfaidx if an index is not found
   :type path: str
   :param pyfaidx_fasta: Alternative method to provide fasta target, as a pyfaidx.Fasta object
   :type pyfaidx_fasta: pyfaidx.Fasta

   :returns: Sequences, one per interval.
   :rtype: Series

   .. note::

      This function requires the library pyfaidx, it can be installed with
      ``conda install -c bioconda pyfaidx`` or ``pip install pyfaidx``.

      Sorting the PyRanges is likely to improve the speed.
      Intervals on the negative strand will be reverse complemented.

   .. warning:: Note that the names in the fasta header and gr must be the same.

   .. seealso::

      :obj:`get_transcript_sequence`
          obtain mRNA sequences, by joining exons belonging to the same transcript

   .. rubric:: Examples

   >>> gr = pr.from_dict({"Chromosome": ["chr1", "chr1"],
   ...                    "Start": [5, 0], "End": [8, 5]})

   >>> gr
   +--------------+-----------+-----------+
   | Chromosome   |     Start |       End |
   | (category)   |   (int64) |   (int64) |
   |--------------+-----------+-----------|
   | chr1         |         5 |         8 |
   | chr1         |         0 |         5 |
   +--------------+-----------+-----------+
   Unstranded PyRanges object has 2 rows and 3 columns from 1 chromosomes.
   For printing, the PyRanges was sorted on Chromosome.

   >>> tmp_handle = open("temp.fasta", "w+")
   >>> _ = tmp_handle.write(">chr1\n")
   >>> _ = tmp_handle.write("ATTACCAT\n")
   >>> tmp_handle.close()

   >>> seq = pr.get_sequence(gr, "temp.fasta")

   >>> seq
   0      CAT
   1    ATTAC
   dtype: object

   >>> gr.seq = seq
   >>> gr
   +--------------+-----------+-----------+------------+
   | Chromosome   |     Start |       End | seq        |
   | (category)   |   (int64) |   (int64) | (object)   |
   |--------------+-----------+-----------+------------|
   | chr1         |         5 |         8 | CAT        |
   | chr1         |         0 |         5 | ATTAC      |
   +--------------+-----------+-----------+------------+
   Unstranded PyRanges object has 2 rows and 4 columns from 1 chromosomes.
   For printing, the PyRanges was sorted on Chromosome.


.. py:function:: get_fasta(*args, **kwargs)

   Deprecated: this function has been moved to Pyranges.get_sequence


.. py:function:: get_transcript_sequence(gr, group_by, path=None, pyfaidx_fasta=None)

   Get the sequence of mRNAs, e.g. joining intervals corresponding to exons of the same transcript

   :param gr: Coordinates.
   :type gr: PyRanges
   :param group_by: intervals are grouped by this/these ID column(s): these are exons belonging to same transcript
   :type group_by: str or list of str
   :param path: Path to fasta file. It will be indexed using pyfaidx if an index is not found
   :type path: str
   :param pyfaidx_fasta: Alternative method to provide fasta target, as a pyfaidx.Fasta object
   :type pyfaidx_fasta: pyfaidx.Fasta

   :returns: Pandas DataFrame with a column for Sequence, plus ID column(s) provided with "group_by"
   :rtype: DataFrame

   .. note::

      This function requires the library pyfaidx, it can be installed with
      ``conda install -c bioconda pyfaidx`` or ``pip install pyfaidx``.

      Sorting the PyRanges is likely to improve the speed.
      Intervals on the negative strand will be reverse complemented.

   .. warning:: Note that the names in the fasta header and gr must be the same.

   .. seealso::

      :obj:`get_sequence`
          obtain sequence of single intervals

   .. rubric:: Examples

   >>> gr = pr.from_dict({"Chromosome": ['chr1', 'chr1', 'chr1'],
   ...                    "Start": [0, 9, 18], "End": [4, 13, 21],
   ...                    "Strand":['+', '-', '-'],
   ...                    "transcript": ['t1', 't2', 't2']})
   >>> gr
   +--------------+-----------+-----------+--------------+--------------+
   | Chromosome   |     Start |       End | Strand       | transcript   |
   | (category)   |   (int64) |   (int64) | (category)   | (object)     |
   |--------------+-----------+-----------+--------------+--------------|
   | chr1         |         0 |         4 | +            | t1           |
   | chr1         |         9 |        13 | -            | t2           |
   | chr1         |        18 |        21 | -            | t2           |
   +--------------+-----------+-----------+--------------+--------------+
   Stranded PyRanges object has 3 rows and 5 columns from 1 chromosomes.
   For printing, the PyRanges was sorted on Chromosome and Strand.

   >>> tmp_handle = open("temp.fasta", "w+")
   >>> _ = tmp_handle.write(">chr1\n")
   >>> _ = tmp_handle.write("AAACCCTTTGGGAAACCCTTTGGG\n")
   >>> tmp_handle.close()

   >>> seq = pr.get_transcript_sequence(gr, path="temp.fasta", group_by='transcript')
   >>> seq
     transcript Sequence
   0         t1     AAAC
   1         t2  AAATCCC

   To write to a file in fasta format:
   # with open('outfile.fasta', 'w') as fw:
   #     nchars=60
   #     for row in seq.itertuples():
   #         s = '\n'.join([ row.Sequence[i:i+nchars] for i in range(0, len(row.Sequence), nchars)])
   #         fw.write(f'>{row.transcript}\n{s}\n')