Orfs: Open Reading Frames

Utilities to work with open reading frames and coding sequences (e.g. CDS features in gene annotation files).

class pyranges1.ext.orfs.Any(*args, **kwargs)

Special type indicating an unconstrained type.

Any is compatible with every type.
Any assumed to have all methods.
All values assumed to be instances of Any.

Note that all the above statements are true from the point of view of static type checkers. At runtime, Any should not be used with instance checks.

pyranges1.ext.orfs.arg_to_list(by: str | Iterable[str] | None) → list[str]

Convert a by or column_key argument to a list.

If by is a string, it will be converted to a list. If by is None, an empty list will be returned. If by is already a list, it will be returned as is.

pyranges1.ext.orfs.calculate_frame(p: pr.PyRanges, group_by: str | list[str], frame_col: str = 'Frame') → pr.PyRanges

Calculate the frame of genomic intervals, assuming all are coding sequences (CDS), and add it as column.

A stranded After this, the input Pyranges will contain an added “Frame” column, which determines the nucleotide of the CDS that is the first base of a codon.Resulting values are in range between 0 and 2 included. 0 indicates that the first nucleotide of that interval is the first base of a codon, 1 indicates the second base and 2 indicates the third base. While the 5’-most interval of each transcript has always 0 frame, the following ones may have any of these values.

Parameters:

p (PyRanges) – Input CDS intervals.
group_by (str or list of str) – Column(s) to group by the intervals: coding exons belonging to the same transcript have the same values in this/these column(s).
frame_col (str, default 'Frame') – Name of the column to store the frame values.

Return type:

PyRanges

Examples

>>> import pyranges1 as pr
>>> p = pr.PyRanges({"Chromosome": [1,1,1,2,2],
...                   "Strand": ["+","+","+","-","-"],
...                   "Start": [1,31,52,101,201],
...                   "End": [10,45,90,130,218],
...                   "transcript_id": ["t1","t1","t1","t2","t2"]})
>>> p
  index  |      Chromosome  Strand      Start      End  transcript_id
  int64  |           int64  str         int64    int64  str
-------  ---  ------------  --------  -------  -------  ---------------
      0  |               1  +               1       10  t1
      1  |               1  +              31       45  t1
      2  |               1  +              52       90  t1
      3  |               2  -             101      130  t2
      4  |               2  -             201      218  t2
PyRanges with 5 rows, 5 columns, and 1 index columns.
Contains 2 chromosomes and 2 strands.

>>> pr.orfs.calculate_frame(p, group_by=['transcript_id'])
  index  |      Chromosome  Strand      Start      End  transcript_id      Frame
  int64  |           int64  str         int64    int64  str                int64
-------  ---  ------------  --------  -------  -------  ---------------  -------
      0  |               1  +               1       10  t1                     0
      1  |               1  +              31       45  t1                     0
      2  |               1  +              52       90  t1                     2
      3  |               2  -             101      130  t2                     2
      4  |               2  -             201      218  t2                     0
PyRanges with 5 rows, 6 columns, and 1 index columns.
Contains 2 chromosomes and 2 strands.

pyranges1.ext.orfs.ensure_pyranges(df: pd.DataFrame) → PyRanges

Ensure df is a PyRanges.

Helps pyright.

pyranges1.ext.orfs.extend_orfs(p: pr.PyRanges, fasta_path: str, group_by: str | list[str] | None = None, *, direction: Iterable[Literal['up', 'down']] | Literal['up', 'down'] | None = None, starts: list[str] = ['ATG'], stops: list[str] = ['TAG', 'TGA', 'TAA'], keep_off_bounds: bool = False, record_extensions: bool = False, chunk_size: int = 900, verbose: bool = False) → pr.PyRanges

Extend PyRanges intervals to form complete open reading frames.

The input intervals are extended their next Stop codon downstream, and to their leftmost Start codon upstream before encountering a Stop.

Parameters:

p (PyRanges) – Input CDS intervals.
fasta_path (location of the Fasta file from which the sequences) – for the extensions will be retrieved.
group_by (str or list of str or None) – Name(s) of column(s) to group intervals into transcripts
starts (list containing the nucleotide pattern to look for upstream.) – if not provided, ORFs are delimited by stops Default [‘ATG’]
stops (list containing the nucleotide pattern to look for downstream.) – Default [‘TAG’, ‘TGA’, ‘TAA’]
direction (whether the extension should be upstream ('up'), downstream) – (‘down’) or both. Default (None) means: [‘up’, ‘down’]
keep_off_bounds (if True, those intervals that reached out of bounds during extension) – without finding any stop are returned in their largest (3-nt multiple) extension. In this case, these intervals will not begin with a start or end with a stop
record_extensions (if True, add columns extension_up and extension_down) – with the extensions amounts. Default: False
chunk_size (the amount of nucleotides to be extended on each iteration. Does not affect output, only speed/memory.) – Default 900.
verbose (if True, print information about the progress of the extension.) – Default: False

Note

This function requires the library pyfaidx, it can be installed with conda install -c bioconda pyfaidx or pip install pyfaidx.

Sorting the PyRanges is likely to improve the speed. Intervals on the negative strand will be reverse complemented.

Examples

>>> import pyranges1 as pr
>>> p = pr.PyRanges({"Chromosome": ['seq1'], "Start":[20], "End":[29], "Strand" : ["+"]})
>>> p
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    seq1               20       29  +
PyRanges with 1 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.

>>> #            *       ^       ^      ... ... ...          *       #  ... = p interval
>>> seq1 = " AA TAA TGT ATG GTA ATG GGC GCC GGG ATT CCA CAG TAA GTG C".replace(' ', '')
>>> tmp_handle = open("temp.fasta", "w+")
>>> _ = tmp_handle.write(">seq1\n")
>>> _ = tmp_handle.write(seq1+'\n')
>>> tmp_handle.close()

>>> p.get_sequence("temp.fasta")
0    GCCGGGATT
Name: Sequence, dtype: object

>>> ep = pr.orfs.extend_orfs(p, fasta_path="temp.fasta")
>>> ep
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    seq1                8       38  +
PyRanges with 1 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.

>>> ep.get_sequence("temp.fasta")
0    ATGGTAATGGGCGCCGGGATTCCACAGTAA
Name: Sequence, dtype: object

>>> pr.orfs.extend_orfs(p, fasta_path="temp.fasta", record_extensions=True)
  index  |    Chromosome      Start      End  Strand      extension_up    extension_down
  int64  |    str             int64    int64  str                int64             int64
-------  ---  ------------  -------  -------  --------  --------------  ----------------
      0  |    seq1                8       38  +                     12                 9
PyRanges with 1 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.

Extending only in one direction:

>>> pr.orfs.extend_orfs(p, fasta_path="temp.fasta", direction='up')
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    seq1                8       29  +
PyRanges with 1 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.

With starts=[], any codon can be used as a start (i.e. ORFs defined as stop-delimited sequences):

>>> ep=pr.orfs.extend_orfs(p, fasta_path="temp.fasta", starts=[])
>>> ep
  index  |    Chromosome      Start      End  Strand
  int64  |    str             int64    int64  str
-------  ---  ------------  -------  -------  --------
      0  |    seq1                5       38  +
PyRanges with 1 rows, 4 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.

>>> ep.get_sequence("temp.fasta")
0    TGTATGGTAATGGGCGCCGGGATTCCACAGTAA
Name: Sequence, dtype: object

Example with multi-exon input intervals Intervals on the negative strand are extended accordingly

>>> #                   :    *           ^      ... .        ..      *      # ... = p interval
>>> # reverse complement: C TAG CGT TTG ATG TTG GGC CAG GTG TTT CAG TAG CCC GG
>>> seq2 = " CC GGG CTA CTG AAA CAC CTG GCC CAA CAT CAA ACG CTA G".replace(' ', '')
>>> tmp_handle = open("temp1.fasta", "w+")
>>> _ = tmp_handle.write(">seq2\n")
>>> _ = tmp_handle.write(seq2+'\n')
>>> tmp_handle.close()

>>> np = pr.PyRanges({"Chromosome": ['seq2']*2, "Start":[19, 11], "End":[23, 13],
...                   "Strand" : ["-"]*2, "ID":["a", "a"]})
>>> np
  index  |    Chromosome      Start      End  Strand    ID
  int64  |    str             int64    int64  str       str
-------  ---  ------------  -------  -------  --------  -----
      0  |    seq2               19       23  -         a
      1  |    seq2               11       13  -         a
PyRanges with 2 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.

>>> np.get_sequence("temp1.fasta")
0    GGCC
1      TT
Name: Sequence, dtype: object

>>> ep = pr.orfs.extend_orfs(np, fasta_path="temp1.fasta", group_by='ID')
>>> ep.get_sequence("temp1.fasta")
0    ATGTTGGGCC
1      TTCAGTAG
Name: Sequence, dtype: object

A sequence with no in-frame stops after the input interval before the end:

>>> #             *       ^       ^      ... ... ...                   #  ... = p interval
>>> seq1b = " AA TAA TGT ATG GTA ATG GGC GCC GGG ATT CCA CAG AAA GTG C".replace(' ', '')
>>> tmp_handle = open("temp2.fasta", "w+")
>>> _ = tmp_handle.write(">seq1\n")
>>> _ = tmp_handle.write(seq1b+'\n')
>>> tmp_handle.close()

>>> pr.orfs.extend_orfs(p, fasta_path="temp2.fasta", record_extensions=True)
  index  |    Chromosome      Start      End  Strand      extension_up    extension_down
  int64  |    str             int64    int64  str                int64             int64
-------  ---  ------------  -------  -------  --------  --------------  ----------------
      0  |    seq1                8       29  +                     12                 0
PyRanges with 1 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.

Showcasing keep_off_bounds:

>>> ep=pr.orfs.extend_orfs(p, fasta_path="temp2.fasta", record_extensions=True, keep_off_bounds=True)
>>> ep
  index  |    Chromosome      Start      End  Strand      extension_up    extension_down
  int64  |    str             int64    int64  str                int64             int64
-------  ---  ------------  -------  -------  --------  --------------  ----------------
      0  |    seq1                8       41  +                     12                12
PyRanges with 1 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.
>>> ep.get_sequence("temp2.fasta")
0    ATGGTAATGGGCGCCGGGATTCCACAGAAAGTG
Name: Sequence, dtype: object

A sequence with no in-frame stops BEFORE the input interval:

>>> #                     ^       ^      ... ... ...          *        #  ... = p interval
>>> seq1c = " AA TAC TGT ATG GTA ATG GGC GCC GGG ATT CCA CAG TAA GTG C".replace(' ', '')
>>> tmp_handle = open("temp3.fasta", "w+")
>>> _ = tmp_handle.write(">seq1\n")
>>> _ = tmp_handle.write(seq1c+'\n')
>>> tmp_handle.close()

>>> pr.orfs.extend_orfs(p, fasta_path="temp3.fasta", record_extensions=True)
  index  |    Chromosome      Start      End  Strand      extension_up    extension_down
  int64  |    str             int64    int64  str                int64             int64
-------  ---  ------------  -------  -------  --------  --------------  ----------------
      0  |    seq1                8       38  +                     12                 9
PyRanges with 1 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.

>>> ep = pr.orfs.extend_orfs(p, fasta_path="temp3.fasta", record_extensions=True, keep_off_bounds=True)
>>> ep
  index  |    Chromosome      Start      End  Strand      extension_up    extension_down
  int64  |    str             int64    int64  str                int64             int64
-------  ---  ------------  -------  -------  --------  --------------  ----------------
      0  |    seq1                2       38  +                     18                 9
PyRanges with 1 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.

>>> ep.get_sequence("temp3.fasta")
0    TACTGTATGGTAATGGGCGCCGGGATTCCACAGTAA
Name: Sequence, dtype: object