pyranges.get_fasta

Module Contents

Functions

get_sequence(gr[, path, pyfaidx_fasta])

Get the sequence of the intervals from a fasta file

get_fasta(*args, **kwargs)

Deprecated: this function has been moved to Pyranges.get_sequence

get_transcript_sequence(gr, group_by[, path, ...])

Get the sequence of mRNAs, e.g. joining intervals corresponding to exons of the same transcript

pyranges.get_fasta.get_sequence(gr, path=None, pyfaidx_fasta=None)

Get the sequence of the intervals from a fasta file

Parameters:
  • gr (PyRanges) – Coordinates.

  • path (str) – Path to fasta file. It will be indexed using pyfaidx if an index is not found

  • pyfaidx_fasta (pyfaidx.Fasta) – Alternative method to provide fasta target, as a pyfaidx.Fasta object

Returns:

Sequences, one per interval.

Return type:

Series

Note

This function requires the library pyfaidx, it can be installed with conda install -c bioconda pyfaidx or pip install pyfaidx.

Sorting the PyRanges is likely to improve the speed. Intervals on the negative strand will be reverse complemented.

Warning

Note that the names in the fasta header and gr must be the same.

See also

get_transcript_sequence

obtain mRNA sequences, by joining exons belonging to the same transcript

Examples

>>> gr = pr.from_dict({"Chromosome": ["chr1", "chr1"],
...                    "Start": [5, 0], "End": [8, 5]})
>>> gr
+--------------+-----------+-----------+
| Chromosome   |     Start |       End |
| (category)   |   (int64) |   (int64) |
|--------------+-----------+-----------|
| chr1         |         5 |         8 |
| chr1         |         0 |         5 |
+--------------+-----------+-----------+
Unstranded PyRanges object has 2 rows and 3 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome.
>>> tmp_handle = open("temp.fasta", "w+")
>>> _ = tmp_handle.write(">chr1\n")
>>> _ = tmp_handle.write("ATTACCAT\n")
>>> tmp_handle.close()
>>> seq = pr.get_sequence(gr, "temp.fasta")
>>> seq
0      CAT
1    ATTAC
dtype: object
>>> gr.seq = seq
>>> gr
+--------------+-----------+-----------+------------+
| Chromosome   |     Start |       End | seq        |
| (category)   |   (int64) |   (int64) | (object)   |
|--------------+-----------+-----------+------------|
| chr1         |         5 |         8 | CAT        |
| chr1         |         0 |         5 | ATTAC      |
+--------------+-----------+-----------+------------+
Unstranded PyRanges object has 2 rows and 4 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome.
pyranges.get_fasta.get_fasta(*args, **kwargs)

Deprecated: this function has been moved to Pyranges.get_sequence

pyranges.get_fasta.get_transcript_sequence(gr, group_by, path=None, pyfaidx_fasta=None)

Get the sequence of mRNAs, e.g. joining intervals corresponding to exons of the same transcript

Parameters:
  • gr (PyRanges) – Coordinates.

  • group_by (str or list of str) – intervals are grouped by this/these ID column(s): these are exons belonging to same transcript

  • path (str) – Path to fasta file. It will be indexed using pyfaidx if an index is not found

  • pyfaidx_fasta (pyfaidx.Fasta) – Alternative method to provide fasta target, as a pyfaidx.Fasta object

Returns:

Pandas DataFrame with a column for Sequence, plus ID column(s) provided with “group_by”

Return type:

DataFrame

Note

This function requires the library pyfaidx, it can be installed with conda install -c bioconda pyfaidx or pip install pyfaidx.

Sorting the PyRanges is likely to improve the speed. Intervals on the negative strand will be reverse complemented.

Warning

Note that the names in the fasta header and gr must be the same.

See also

get_sequence

obtain sequence of single intervals

Examples

>>> gr = pr.from_dict({"Chromosome": ['chr1', 'chr1', 'chr1'],
...                    "Start": [0, 9, 18], "End": [4, 13, 21],
...                    "Strand":['+', '-', '-'],
...                    "transcript": ['t1', 't2', 't2']})
>>> gr
+--------------+-----------+-----------+--------------+--------------+
| Chromosome   |     Start |       End | Strand       | transcript   |
| (category)   |   (int64) |   (int64) | (category)   | (object)     |
|--------------+-----------+-----------+--------------+--------------|
| chr1         |         0 |         4 | +            | t1           |
| chr1         |         9 |        13 | -            | t2           |
| chr1         |        18 |        21 | -            | t2           |
+--------------+-----------+-----------+--------------+--------------+
Stranded PyRanges object has 3 rows and 5 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
>>> tmp_handle = open("temp.fasta", "w+")
>>> _ = tmp_handle.write(">chr1\n")
>>> _ = tmp_handle.write("AAACCCTTTGGGAAACCCTTTGGG\n")
>>> tmp_handle.close()
>>> seq = pr.get_transcript_sequence(gr, path="temp.fasta", group_by='transcript')
>>> seq
  transcript Sequence
0         t1     AAAC
1         t2  AAATCCC

To write to a file in fasta format: # with open(‘outfile.fasta’, ‘w’) as fw: # nchars=60 # for row in seq.itertuples(): # s = ‘n’.join([ row.Sequence[i:i+nchars] for i in range(0, len(row.Sequence), nchars)]) # fw.write(f’>{row.transcript}n{s}n’)