Orfs: Open Reading Frames
Utilities to work with open reading frames and coding sequences (e.g. CDS features in gene annotation files).
- class pyranges1.ext.orfs.Any(*args, **kwargs)
Special type indicating an unconstrained type.
Any is compatible with every type.
Any assumed to have all methods.
All values assumed to be instances of Any.
Note that all the above statements are true from the point of view of static type checkers. At runtime, Any should not be used with instance checks.
- pyranges1.ext.orfs.arg_to_list(by: str | Iterable[str] | None) list[str]
Convert a by or column_key argument to a list.
If by is a string, it will be converted to a list. If by is None, an empty list will be returned. If by is already a list, it will be returned as is.
- pyranges1.ext.orfs.calculate_frame(p: pr.PyRanges, group_by: str | list[str], frame_col: str = 'Frame') pr.PyRanges
Calculate the frame of genomic intervals, assuming all are coding sequences (CDS), and add it as column.
A stranded After this, the input Pyranges will contain an added “Frame” column, which determines the nucleotide of the CDS that is the first base of a codon.Resulting values are in range between 0 and 2 included. 0 indicates that the first nucleotide of that interval is the first base of a codon, 1 indicates the second base and 2 indicates the third base. While the 5’-most interval of each transcript has always 0 frame, the following ones may have any of these values.
- Parameters:
p (PyRanges) – Input CDS intervals.
group_by (str or list of str) – Column(s) to group by the intervals: coding exons belonging to the same transcript have the same values in this/these column(s).
frame_col (str, default 'Frame') – Name of the column to store the frame values.
- Return type:
Examples
>>> import pyranges1 as pr >>> p = pr.PyRanges({"Chromosome": [1,1,1,2,2], ... "Strand": ["+","+","+","-","-"], ... "Start": [1,31,52,101,201], ... "End": [10,45,90,130,218], ... "transcript_id": ["t1","t1","t1","t2","t2"]}) >>> p index | Chromosome Strand Start End transcript_id int64 | int64 str int64 int64 str ------- --- ------------ -------- ------- ------- --------------- 0 | 1 + 1 10 t1 1 | 1 + 31 45 t1 2 | 1 + 52 90 t1 3 | 2 - 101 130 t2 4 | 2 - 201 218 t2 PyRanges with 5 rows, 5 columns, and 1 index columns. Contains 2 chromosomes and 2 strands.
>>> pr.orfs.calculate_frame(p, group_by=['transcript_id']) index | Chromosome Strand Start End transcript_id Frame int64 | int64 str int64 int64 str int64 ------- --- ------------ -------- ------- ------- --------------- ------- 0 | 1 + 1 10 t1 0 1 | 1 + 31 45 t1 0 2 | 1 + 52 90 t1 2 3 | 2 - 101 130 t2 2 4 | 2 - 201 218 t2 0 PyRanges with 5 rows, 6 columns, and 1 index columns. Contains 2 chromosomes and 2 strands.
- pyranges1.ext.orfs.ensure_pyranges(df: pd.DataFrame) PyRanges
Ensure df is a PyRanges.
Helps pyright.
- pyranges1.ext.orfs.extend_orfs(p: pr.PyRanges, fasta_path: str, group_by: str | list[str] | None = None, *, direction: Iterable[Literal['up', 'down']] | Literal['up', 'down'] | None = None, starts: list[str] = ['ATG'], stops: list[str] = ['TAG', 'TGA', 'TAA'], keep_off_bounds: bool = False, record_extensions: bool = False, chunk_size: int = 900, verbose: bool = False) pr.PyRanges
Extend PyRanges intervals to form complete open reading frames.
The input intervals are extended their next Stop codon downstream, and to their leftmost Start codon upstream before encountering a Stop.
- Parameters:
p (PyRanges) – Input CDS intervals.
fasta_path (location of the Fasta file from which the sequences) – for the extensions will be retrieved.
group_by (str or list of str or None) – Name(s) of column(s) to group intervals into transcripts
starts (list containing the nucleotide pattern to look for upstream.) – if not provided, ORFs are delimited by stops Default [‘ATG’]
stops (list containing the nucleotide pattern to look for downstream.) – Default [‘TAG’, ‘TGA’, ‘TAA’]
direction (whether the extension should be upstream ('up'), downstream) – (‘down’) or both. Default (None) means: [‘up’, ‘down’]
keep_off_bounds (if True, those intervals that reached out of bounds during extension) – without finding any stop are returned in their largest (3-nt multiple) extension. In this case, these intervals will not begin with a start or end with a stop
record_extensions (if True, add columns extension_up and extension_down) – with the extensions amounts. Default: False
chunk_size (the amount of nucleotides to be extended on each iteration. Does not affect output, only speed/memory.) – Default 900.
verbose (if True, print information about the progress of the extension.) – Default: False
Note
This function requires the library pyfaidx, it can be installed with
conda install -c bioconda pyfaidxorpip install pyfaidx.Sorting the PyRanges is likely to improve the speed. Intervals on the negative strand will be reverse complemented.
Examples
>>> import pyranges1 as pr >>> p = pr.PyRanges({"Chromosome": ['seq1'], "Start":[20], "End":[29], "Strand" : ["+"]}) >>> p index | Chromosome Start End Strand int64 | str int64 int64 str ------- --- ------------ ------- ------- -------- 0 | seq1 20 29 + PyRanges with 1 rows, 4 columns, and 1 index columns. Contains 1 chromosomes and 1 strands.
>>> # * ^ ^ ... ... ... * # ... = p interval >>> seq1 = " AA TAA TGT ATG GTA ATG GGC GCC GGG ATT CCA CAG TAA GTG C".replace(' ', '') >>> tmp_handle = open("temp.fasta", "w+") >>> _ = tmp_handle.write(">seq1\n") >>> _ = tmp_handle.write(seq1+'\n') >>> tmp_handle.close()
>>> p.get_sequence("temp.fasta") 0 GCCGGGATT Name: Sequence, dtype: object
>>> ep = pr.orfs.extend_orfs(p, fasta_path="temp.fasta") >>> ep index | Chromosome Start End Strand int64 | str int64 int64 str ------- --- ------------ ------- ------- -------- 0 | seq1 8 38 + PyRanges with 1 rows, 4 columns, and 1 index columns. Contains 1 chromosomes and 1 strands.
>>> ep.get_sequence("temp.fasta") 0 ATGGTAATGGGCGCCGGGATTCCACAGTAA Name: Sequence, dtype: object
>>> pr.orfs.extend_orfs(p, fasta_path="temp.fasta", record_extensions=True) index | Chromosome Start End Strand extension_up extension_down int64 | str int64 int64 str int64 int64 ------- --- ------------ ------- ------- -------- -------------- ---------------- 0 | seq1 8 38 + 12 9 PyRanges with 1 rows, 6 columns, and 1 index columns. Contains 1 chromosomes and 1 strands.
Extending only in one direction:
>>> pr.orfs.extend_orfs(p, fasta_path="temp.fasta", direction='up') index | Chromosome Start End Strand int64 | str int64 int64 str ------- --- ------------ ------- ------- -------- 0 | seq1 8 29 + PyRanges with 1 rows, 4 columns, and 1 index columns. Contains 1 chromosomes and 1 strands.
With starts=[], any codon can be used as a start (i.e. ORFs defined as stop-delimited sequences):
>>> ep=pr.orfs.extend_orfs(p, fasta_path="temp.fasta", starts=[]) >>> ep index | Chromosome Start End Strand int64 | str int64 int64 str ------- --- ------------ ------- ------- -------- 0 | seq1 5 38 + PyRanges with 1 rows, 4 columns, and 1 index columns. Contains 1 chromosomes and 1 strands.
>>> ep.get_sequence("temp.fasta") 0 TGTATGGTAATGGGCGCCGGGATTCCACAGTAA Name: Sequence, dtype: object
Example with multi-exon input intervals Intervals on the negative strand are extended accordingly
>>> # : * ^ ... . .. * # ... = p interval >>> # reverse complement: C TAG CGT TTG ATG TTG GGC CAG GTG TTT CAG TAG CCC GG >>> seq2 = " CC GGG CTA CTG AAA CAC CTG GCC CAA CAT CAA ACG CTA G".replace(' ', '') >>> tmp_handle = open("temp1.fasta", "w+") >>> _ = tmp_handle.write(">seq2\n") >>> _ = tmp_handle.write(seq2+'\n') >>> tmp_handle.close()
>>> np = pr.PyRanges({"Chromosome": ['seq2']*2, "Start":[19, 11], "End":[23, 13], ... "Strand" : ["-"]*2, "ID":["a", "a"]}) >>> np index | Chromosome Start End Strand ID int64 | str int64 int64 str str ------- --- ------------ ------- ------- -------- ----- 0 | seq2 19 23 - a 1 | seq2 11 13 - a PyRanges with 2 rows, 5 columns, and 1 index columns. Contains 1 chromosomes and 1 strands.
>>> np.get_sequence("temp1.fasta") 0 GGCC 1 TT Name: Sequence, dtype: object
>>> ep = pr.orfs.extend_orfs(np, fasta_path="temp1.fasta", group_by='ID') >>> ep.get_sequence("temp1.fasta") 0 ATGTTGGGCC 1 TTCAGTAG Name: Sequence, dtype: object
A sequence with no in-frame stops after the input interval before the end:
>>> # * ^ ^ ... ... ... # ... = p interval >>> seq1b = " AA TAA TGT ATG GTA ATG GGC GCC GGG ATT CCA CAG AAA GTG C".replace(' ', '') >>> tmp_handle = open("temp2.fasta", "w+") >>> _ = tmp_handle.write(">seq1\n") >>> _ = tmp_handle.write(seq1b+'\n') >>> tmp_handle.close()
>>> pr.orfs.extend_orfs(p, fasta_path="temp2.fasta", record_extensions=True) index | Chromosome Start End Strand extension_up extension_down int64 | str int64 int64 str int64 int64 ------- --- ------------ ------- ------- -------- -------------- ---------------- 0 | seq1 8 29 + 12 0 PyRanges with 1 rows, 6 columns, and 1 index columns. Contains 1 chromosomes and 1 strands.
Showcasing keep_off_bounds:
>>> ep=pr.orfs.extend_orfs(p, fasta_path="temp2.fasta", record_extensions=True, keep_off_bounds=True) >>> ep index | Chromosome Start End Strand extension_up extension_down int64 | str int64 int64 str int64 int64 ------- --- ------------ ------- ------- -------- -------------- ---------------- 0 | seq1 8 41 + 12 12 PyRanges with 1 rows, 6 columns, and 1 index columns. Contains 1 chromosomes and 1 strands. >>> ep.get_sequence("temp2.fasta") 0 ATGGTAATGGGCGCCGGGATTCCACAGAAAGTG Name: Sequence, dtype: object
A sequence with no in-frame stops BEFORE the input interval:
>>> # ^ ^ ... ... ... * # ... = p interval >>> seq1c = " AA TAC TGT ATG GTA ATG GGC GCC GGG ATT CCA CAG TAA GTG C".replace(' ', '') >>> tmp_handle = open("temp3.fasta", "w+") >>> _ = tmp_handle.write(">seq1\n") >>> _ = tmp_handle.write(seq1c+'\n') >>> tmp_handle.close()
>>> pr.orfs.extend_orfs(p, fasta_path="temp3.fasta", record_extensions=True) index | Chromosome Start End Strand extension_up extension_down int64 | str int64 int64 str int64 int64 ------- --- ------------ ------- ------- -------- -------------- ---------------- 0 | seq1 8 38 + 12 9 PyRanges with 1 rows, 6 columns, and 1 index columns. Contains 1 chromosomes and 1 strands.
>>> ep = pr.orfs.extend_orfs(p, fasta_path="temp3.fasta", record_extensions=True, keep_off_bounds=True) >>> ep index | Chromosome Start End Strand extension_up extension_down int64 | str int64 int64 str int64 int64 ------- --- ------------ ------- ------- -------- -------------- ---------------- 0 | seq1 2 38 + 18 9 PyRanges with 1 rows, 6 columns, and 1 index columns. Contains 1 chromosomes and 1 strands.
>>> ep.get_sequence("temp3.fasta") 0 TACTGTATGGTAATGGGCGCCGGGATTCCACAGTAA Name: Sequence, dtype: object