:py:mod:`pyranges.get_fasta` ============================ .. py:module:: pyranges.get_fasta Module Contents --------------- Functions ~~~~~~~~~ .. autoapisummary:: pyranges.get_fasta.get_sequence pyranges.get_fasta.get_fasta pyranges.get_fasta.get_transcript_sequence .. py:function:: get_sequence(gr, path=None, pyfaidx_fasta=None) Get the sequence of the intervals from a fasta file :param gr: Coordinates. :type gr: PyRanges :param path: Path to fasta file. It will be indexed using pyfaidx if an index is not found :type path: str :param pyfaidx_fasta: Alternative method to provide fasta target, as a pyfaidx.Fasta object :type pyfaidx_fasta: pyfaidx.Fasta :returns: Sequences, one per interval. :rtype: Series .. note:: This function requires the library pyfaidx, it can be installed with ``conda install -c bioconda pyfaidx`` or ``pip install pyfaidx``. Sorting the PyRanges is likely to improve the speed. Intervals on the negative strand will be reverse complemented. .. warning:: Note that the names in the fasta header and gr must be the same. .. seealso:: :obj:`get_transcript_sequence` obtain mRNA sequences, by joining exons belonging to the same transcript .. rubric:: Examples >>> gr = pr.from_dict({"Chromosome": ["chr1", "chr1"], ... "Start": [5, 0], "End": [8, 5]}) >>> gr +--------------+-----------+-----------+ | Chromosome | Start | End | | (category) | (int64) | (int64) | |--------------+-----------+-----------| | chr1 | 5 | 8 | | chr1 | 0 | 5 | +--------------+-----------+-----------+ Unstranded PyRanges object has 2 rows and 3 columns from 1 chromosomes. For printing, the PyRanges was sorted on Chromosome. >>> tmp_handle = open("temp.fasta", "w+") >>> _ = tmp_handle.write(">chr1\n") >>> _ = tmp_handle.write("ATTACCAT\n") >>> tmp_handle.close() >>> seq = pr.get_sequence(gr, "temp.fasta") >>> seq 0 CAT 1 ATTAC dtype: object >>> gr.seq = seq >>> gr +--------------+-----------+-----------+------------+ | Chromosome | Start | End | seq | | (category) | (int64) | (int64) | (object) | |--------------+-----------+-----------+------------| | chr1 | 5 | 8 | CAT | | chr1 | 0 | 5 | ATTAC | +--------------+-----------+-----------+------------+ Unstranded PyRanges object has 2 rows and 4 columns from 1 chromosomes. For printing, the PyRanges was sorted on Chromosome. .. py:function:: get_fasta(*args, **kwargs) Deprecated: this function has been moved to Pyranges.get_sequence .. py:function:: get_transcript_sequence(gr, group_by, path=None, pyfaidx_fasta=None) Get the sequence of mRNAs, e.g. joining intervals corresponding to exons of the same transcript :param gr: Coordinates. :type gr: PyRanges :param group_by: intervals are grouped by this/these ID column(s): these are exons belonging to same transcript :type group_by: str or list of str :param path: Path to fasta file. It will be indexed using pyfaidx if an index is not found :type path: str :param pyfaidx_fasta: Alternative method to provide fasta target, as a pyfaidx.Fasta object :type pyfaidx_fasta: pyfaidx.Fasta :returns: Pandas DataFrame with a column for Sequence, plus ID column(s) provided with "group_by" :rtype: DataFrame .. note:: This function requires the library pyfaidx, it can be installed with ``conda install -c bioconda pyfaidx`` or ``pip install pyfaidx``. Sorting the PyRanges is likely to improve the speed. Intervals on the negative strand will be reverse complemented. .. warning:: Note that the names in the fasta header and gr must be the same. .. seealso:: :obj:`get_sequence` obtain sequence of single intervals .. rubric:: Examples >>> gr = pr.from_dict({"Chromosome": ['chr1', 'chr1', 'chr1'], ... "Start": [0, 9, 18], "End": [4, 13, 21], ... "Strand":['+', '-', '-'], ... "transcript": ['t1', 't2', 't2']}) >>> gr +--------------+-----------+-----------+--------------+--------------+ | Chromosome | Start | End | Strand | transcript | | (category) | (int64) | (int64) | (category) | (object) | |--------------+-----------+-----------+--------------+--------------| | chr1 | 0 | 4 | + | t1 | | chr1 | 9 | 13 | - | t2 | | chr1 | 18 | 21 | - | t2 | +--------------+-----------+-----------+--------------+--------------+ Stranded PyRanges object has 3 rows and 5 columns from 1 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. >>> tmp_handle = open("temp.fasta", "w+") >>> _ = tmp_handle.write(">chr1\n") >>> _ = tmp_handle.write("AAACCCTTTGGGAAACCCTTTGGG\n") >>> tmp_handle.close() >>> seq = pr.get_transcript_sequence(gr, path="temp.fasta", group_by='transcript') >>> seq transcript Sequence 0 t1 AAAC 1 t2 AAATCCC To write to a file in fasta format: # with open('outfile.fasta', 'w') as fw: # nchars=60 # for row in seq.itertuples(): # s = '\n'.join([ row.Sequence[i:i+nchars] for i in range(0, len(row.Sequence), nchars)]) # fw.write(f'>{row.transcript}\n{s}\n')