Seqs: sequences
Utilities to process nucleotide or protein sequences.
- pyranges1.ext.seqs.clear_kmer_memory() None
Clear memory used by the translation cache.
- pyranges1.ext.seqs.reverse_complement(seq: str | Series | Iterable[str], *, is_rna: bool = False, check: bool = False) str | Series
Reverse complement a DNA sequence.
The input sequence is in DNA alphabet, upper or lowercase (ATGCatgc). The case is maintained in output. Other characters are left unchanged.
- Parameters:
seq (str | Iterable[str]) – nucleotide sequence in DNA format (characters: ATGCatgc), or an iterable of such sequences, e.g. a Series
is_rna (bool) – use this to provide the input seq in RNA format instead (characters: AUGCaugc)
check (bool) – check if the input string contains only characters present in the translation table, raising an error if not
- Returns:
revcompseq – Reverse complement nucleotide sequence in DNA format (or RNA if is_rna was set to True). If an iterable was provided, a Series of reverse complement sequences is returned.
- Return type:
str
Examples
>>> pr.seqs.reverse_complement("ATGAAATTTGGGTGA") 'TCACCCAAATTTCAT'
>>> pr.seqs.reverse_complement("AUGAAAUUUGGGUGA", is_rna=True) 'UCACCCAAAUUUCAU'
>>> pr.seqs.reverse_complement("aaaATCcccGGG") 'CCCgggGATttt'
>>> pr.seqs.reverse_complement("ATCWWWCCCTTT") 'AAAGGGWWWGAT'
>>> pr.seqs.reverse_complement("AUGAAATGGGTGA", check=True) Traceback (most recent call last): ... ValueError: One or more characters in the input string are not present in the translation table.
>>> some_seqs=["ATGAAATTTGGGTGA", "AAAGAAATGGGTGACCCCC"] >>> pr.seqs.reverse_complement(some_seqs) 0 TCACCCAAATTTCAT 1 GGGGGTCACCCATTTCTTT dtype: object
If a Series is provided, the output Series preserve its index:
>>> some_seqs = pd.Series(["ATGAAATTTGGGTGA", "AAAGAAATGGGTGACCCCC"], index=["s1", "s2"]) >>> pr.seqs.reverse_complement(some_seqs) s1 TCACCCAAATTTCAT s2 GGGGGTCACCCATTTCTTT dtype: object
- pyranges1.ext.seqs.translate(seq: str | Series | Iterable[str], *, genetic_code: str | int | dict = '1', unknown: str = 'X', sanitize: bool = True, cache: int | bool = False) str | Series
Translate a coding sequence into protein.
The input sequence is expected to be a DNA sequence in uppercase format (characters: ATGC). Incomplete codons at the end of the sequence, as well as non-canonical codons, result in the unknown character “X”.
- Parameters:
seq (str | Iterable[str]) – nucleotide sequence in uppercase DNA format (characters: ATGC), or an iterable of such sequences, e.g. a Series
genetic_code (int | str | dict) – int or string-converted NCBI index for genetic code (https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi) or dictionary with keys for each codon, values are amino acids (remember to include the translation for gaps
'---':'-') or string-converted NCBI index with a ‘+U’ suffix to have UGA as selenocysteine (U character)unknown (str | None) – codons that are not found in the genetic code table will be translated as this character if None, finding an unknown codon will raise an exception instead
sanitize (bool) – Whether the input is converted to DNA uppercase (from lowercase and/or RNA) to ensure translation works. Set to False if you are sure sequences are already DNA uppercase, to speed up execution
cache (bool | int) – speeds up translation by caching the result of translation of multicodon strings. The 1st time, the function is slow since precomputing all results; then, it is ~3 times faster than non-caching translate. With cache=True, all 3-codon translations are cached (memory 10Mb, precompute ~ 230ms). Provide an int to define how many codons to cache; this is approx the speedup that will be obtained. Note: memory and precomputing grow exponentially with the N of codons cached; You may use pyranges.seqs.clear_kmer_memory() to free memory.
- Returns:
pep – Protein sequence resulting from translation, with gaps as ‘-’ and unknown characters as ‘X’. If an iterable was provided, a Series of protein sequences is returned.
- Return type:
str
Examples
>>> pr.seqs.translate("ATGAAATTTGGGTGA") 'MKFG*'
>>> pr.seqs.translate("ATGTTGCTGAA") 'MLLX'
Translate with the vertebrate mithochondrial genetic code:
>>> pr.seqs.translate("ATGAAATTTGGGTGA", genetic_code=2) 'MKFGW'
Translate with a custom genetic code (all codons starting with A are translated as A, the rest as Q):
>>> gc={codon:'A' if codon.startswith('A') else 'Q' for codon in map(''.join, itertools.product("TCAG", repeat=3))} >>> pr.seqs.translate("ATGAAATTTGGGTGA", genetic_code=gc) 'AAQQQ'
>>> pr.seqs.translate("AUGAAATTtGGGTGA", sanitize=False) 'XKXG*'
>>> pr.seqs.translate("AUGAAATTtGGGTGA", sanitize=False, unknown=None) Traceback (most recent call last): ... ValueError: translate ERROR cannot find codon AUG (pos 0-3) in genetic_code!
Output for list of sequences is a Series:
>>> pr.seqs.translate(['ACTGCATAA', 'ATGGGGTACTAG']) 0 TA* 1 MGY* dtype: object
>>> pr.seqs.translate(['AAUUUtACTGCACTACGACTAGCTAC', 'ACACTGACTGACTATCTGATCGAC'], sanitize=True) 0 NFTALRLAX 1 TLTDYLID dtype: object
When the input is a Series, the output is a Series with the same index:
>>> x = pd.Series(['ACTGCATAA', 'ATGGGGTACTAG'], index=['s1', 's2']) >>> pr.seqs.translate(x) s1 TA* s2 MGY* dtype: object
Caching makes sense when translating many sequences. For large enough data, translate with
cache=Trueis 3x faster than non-caching translate:>>> import random >>> random.seed(42) >>> many_seqs = [ "".join(random.choices("ATGC", k=1000)) for _ in range(100000) ] >>> translated_seqs = pr.seqs.translate(many_seqs, cache=True) >>> translated_seqs[0][:50] 'DRHKKEHVLTRLRGINVIGDRANYLS*CYESYRQGDS*PGRDTLPYHV*D'