Seqs: sequences

Utilities to process nucleotide or protein sequences.

pyranges1.ext.seqs.clear_kmer_memory() None

Clear memory used by the translation cache.

pyranges1.ext.seqs.reverse_complement(seq: str | Series | Iterable[str], *, is_rna: bool = False, check: bool = False) str | Series

Reverse complement a DNA sequence.

The input sequence is in DNA alphabet, upper or lowercase (ATGCatgc). The case is maintained in output. Other characters are left unchanged.

Parameters:
  • seq (str | Iterable[str]) – nucleotide sequence in DNA format (characters: ATGCatgc), or an iterable of such sequences, e.g. a Series

  • is_rna (bool) – use this to provide the input seq in RNA format instead (characters: AUGCaugc)

  • check (bool) – check if the input string contains only characters present in the translation table, raising an error if not

Returns:

revcompseq – Reverse complement nucleotide sequence in DNA format (or RNA if is_rna was set to True). If an iterable was provided, a Series of reverse complement sequences is returned.

Return type:

str

Examples

>>> pr.seqs.reverse_complement("ATGAAATTTGGGTGA")
'TCACCCAAATTTCAT'
>>> pr.seqs.reverse_complement("AUGAAAUUUGGGUGA", is_rna=True)
'UCACCCAAAUUUCAU'
>>> pr.seqs.reverse_complement("aaaATCcccGGG")
'CCCgggGATttt'
>>> pr.seqs.reverse_complement("ATCWWWCCCTTT")
'AAAGGGWWWGAT'
>>> pr.seqs.reverse_complement("AUGAAATGGGTGA", check=True)
Traceback (most recent call last):
...
ValueError: One or more characters in the input string are not present in the translation table.
>>> some_seqs=["ATGAAATTTGGGTGA", "AAAGAAATGGGTGACCCCC"]
>>> pr.seqs.reverse_complement(some_seqs)
0        TCACCCAAATTTCAT
1    GGGGGTCACCCATTTCTTT
dtype: object

If a Series is provided, the output Series preserve its index:

>>> some_seqs = pd.Series(["ATGAAATTTGGGTGA", "AAAGAAATGGGTGACCCCC"], index=["s1", "s2"])
>>> pr.seqs.reverse_complement(some_seqs)
s1        TCACCCAAATTTCAT
s2    GGGGGTCACCCATTTCTTT
dtype: object
pyranges1.ext.seqs.translate(seq: str | Series | Iterable[str], *, genetic_code: str | int | dict = '1', unknown: str = 'X', sanitize: bool = True, cache: int | bool = False) str | Series

Translate a coding sequence into protein.

The input sequence is expected to be a DNA sequence in uppercase format (characters: ATGC). Incomplete codons at the end of the sequence, as well as non-canonical codons, result in the unknown character “X”.

Parameters:
  • seq (str | Iterable[str]) – nucleotide sequence in uppercase DNA format (characters: ATGC), or an iterable of such sequences, e.g. a Series

  • genetic_code (int | str | dict) – int or string-converted NCBI index for genetic code (https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi) or dictionary with keys for each codon, values are amino acids (remember to include the translation for gaps '---':'-') or string-converted NCBI index with a ‘+U’ suffix to have UGA as selenocysteine (U character)

  • unknown (str | None) – codons that are not found in the genetic code table will be translated as this character if None, finding an unknown codon will raise an exception instead

  • sanitize (bool) – Whether the input is converted to DNA uppercase (from lowercase and/or RNA) to ensure translation works. Set to False if you are sure sequences are already DNA uppercase, to speed up execution

  • cache (bool | int) – speeds up translation by caching the result of translation of multicodon strings. The 1st time, the function is slow since precomputing all results; then, it is ~3 times faster than non-caching translate. With cache=True, all 3-codon translations are cached (memory 10Mb, precompute ~ 230ms). Provide an int to define how many codons to cache; this is approx the speedup that will be obtained. Note: memory and precomputing grow exponentially with the N of codons cached; You may use pyranges.seqs.clear_kmer_memory() to free memory.

Returns:

pep – Protein sequence resulting from translation, with gaps as ‘-’ and unknown characters as ‘X’. If an iterable was provided, a Series of protein sequences is returned.

Return type:

str

Examples

>>> pr.seqs.translate("ATGAAATTTGGGTGA")
'MKFG*'
>>> pr.seqs.translate("ATGTTGCTGAA")
'MLLX'

Translate with the vertebrate mithochondrial genetic code:

>>> pr.seqs.translate("ATGAAATTTGGGTGA", genetic_code=2)
'MKFGW'

Translate with a custom genetic code (all codons starting with A are translated as A, the rest as Q):

>>> gc={codon:'A' if codon.startswith('A') else 'Q' for codon in map(''.join, itertools.product("TCAG", repeat=3))}
>>> pr.seqs.translate("ATGAAATTTGGGTGA", genetic_code=gc)
'AAQQQ'
>>> pr.seqs.translate("AUGAAATTtGGGTGA", sanitize=False)
'XKXG*'
>>> pr.seqs.translate("AUGAAATTtGGGTGA", sanitize=False, unknown=None)
Traceback (most recent call last):
...
ValueError: translate ERROR cannot find codon AUG (pos 0-3) in genetic_code!

Output for list of sequences is a Series:

>>> pr.seqs.translate(['ACTGCATAA', 'ATGGGGTACTAG'])
0     TA*
1    MGY*
dtype: object
>>> pr.seqs.translate(['AAUUUtACTGCACTACGACTAGCTAC', 'ACACTGACTGACTATCTGATCGAC'], sanitize=True)
0    NFTALRLAX
1     TLTDYLID
dtype: object

When the input is a Series, the output is a Series with the same index:

>>> x = pd.Series(['ACTGCATAA', 'ATGGGGTACTAG'], index=['s1', 's2'])
>>> pr.seqs.translate(x)
s1     TA*
s2    MGY*
dtype: object

Caching makes sense when translating many sequences. For large enough data, translate with cache=True is 3x faster than non-caching translate:

>>> import random
>>> random.seed(42)
>>> many_seqs = [ "".join(random.choices("ATGC", k=1000)) for _ in range(100000) ]
>>> translated_seqs = pr.seqs.translate(many_seqs, cache=True)
>>> translated_seqs[0][:50]
'DRHKKEHVLTRLRGINVIGDRANYLS*CYESYRQGDS*PGRDTLPYHV*D'