Working with sequences ~~~~~~~~~~~~~~~~~~~~~~ .. contents:: :local: :depth: 2 Fetching sequences per interval ------------------------------- A common operation is to fetch the sequences corresponding to the intervals represented in the PyRanges object. Let's see an example with built-in data. >>> import pyranges1 as pr >>> genome_file = pr.example_data.files['ncbi.fasta'] >>> sg = pr.example_data.ncbi_gff >>> sg = sg[sg.Feature == 'CDS'].get_with_loc_columns('ID') >>> sg = sg.sort_ranges().head(20).copy() >>> sg index | Chromosome Start End Strand ID int64 | category int64 int64 category str ------- --- ----------------- ------- ------- ---------- ---------------- 145 | CAJFCJ010000025.1 3111 3153 - cds-CAD5125114.1 146 | CAJFCJ010000025.1 2753 2851 - cds-CAD5125114.1 135 | CAJFCJ010000025.1 2753 2755 - cds-CAD5125115.1 136 | CAJFCJ010000025.1 2593 2693 - cds-CAD5125115.1 ... | ... ... ... ... ... 41 | CAJFCJ010000053.1 77280 77395 + cds-CAD5126496.1 42 | CAJFCJ010000053.1 77458 77655 + cds-CAD5126496.1 43 | CAJFCJ010000053.1 77724 77839 + cds-CAD5126496.1 44 | CAJFCJ010000053.1 77899 78770 + cds-CAD5126496.1 PyRanges with 20 rows, 5 columns, and 1 index columns. Contains 2 chromosomes and 2 strands. Function :func:`get_sequence ` takes the path to a fasta file, and returns a Series containing those sequences, with the same index as the PyRanges object. It requires the package pyfaidx (see :doc:`add-ons installation <./installation>`. :func:`get_sequence ` will treat each interval independently. The Series returned is typically used to create a new column in the PyRanges object: >>> sg['Sequence'] = sg.get_sequence(genome_file) >>> sg index | Chromosome Start End Strand ID ... int64 | category int64 int64 category str ... ------- --- ----------------- ------- ------- ---------- ---------------- ----- 145 | CAJFCJ010000025.1 3111 3153 - cds-CAD5125114.1 ... 146 | CAJFCJ010000025.1 2753 2851 - cds-CAD5125114.1 ... 135 | CAJFCJ010000025.1 2753 2755 - cds-CAD5125115.1 ... 136 | CAJFCJ010000025.1 2593 2693 - cds-CAD5125115.1 ... ... | ... ... ... ... ... ... 41 | CAJFCJ010000053.1 77280 77395 + cds-CAD5126496.1 ... 42 | CAJFCJ010000053.1 77458 77655 + cds-CAD5126496.1 ... 43 | CAJFCJ010000053.1 77724 77839 + cds-CAD5126496.1 ... 44 | CAJFCJ010000053.1 77899 78770 + cds-CAD5126496.1 ... PyRanges with 20 rows, 6 columns, and 1 index columns. (1 columns not shown: "Sequence"). Contains 2 chromosomes and 2 strands. >>> sg.Sequence.head() 145 ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTA 146 AGTGGGGAATTATTGACCCTTACCTACGGAGCTTTGGTTGCTCAGC... 135 at 136 ggGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA... 147 ggGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA... For intervals on the negative strand, the sequence has been reverse complemented. The ``Sequence`` Series is amenable to string operations, using the pandas ``.str`` operator. Let's convert it to uppercase: >>> sg['Sequence'] = sg.Sequence.str.upper() >>> sg.Sequence.head() 145 ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTA 146 AGTGGGGAATTATTGACCCTTACCTACGGAGCTTTGGTTGCTCAGC... 135 AT 136 GGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA... 147 GGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA... Fetching sequences per mRNA --------------------------- Often we're interested in the sequence of a transcript, i.e. the concatenation of the exons. Using the argument ``group_by`` in function :func:`get_sequence ` allows to get just that, by concatenating the sequences of the intervals in the PyRanges object in the correct order: >>> mrna_seq = sg.get_sequence(genome_file, group_by='ID') >>> mrna_seq # doctest: +NORMALIZE_WHITESPACE ID cds-CAD5125114.1 ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTAAGTG... cds-CAD5125115.1 atggGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCA... cds-CAD5126492.1 ATGAAGATTTTTGctataatatcaatatattttatattatctgAGT... cds-CAD5126493.1 ATGAAtttctatagaaatttttttaatttaattttttgtattaaag... cds-CAD5126495.1 ATGGATTTAGACTTTTTTGCTATATGGTCACCAACTGTAAATTTGA... cds-CAD5126496.1 atggtTATATGCATATTAAGAACTACAGACAACAAAATGAGAATAA... Name: Sequence, dtype: object Again, let's convert to uppercase: >>> mrna_seq = mrna_seq.str.upper() >>> mrna_seq # doctest: +NORMALIZE_WHITESPACE ID cds-CAD5125114.1 ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTAAGTG... cds-CAD5125115.1 ATGGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCA... cds-CAD5126492.1 ATGAAGATTTTTGCTATAATATCAATATATTTTATATTATCTGAGT... cds-CAD5126493.1 ATGAATTTCTATAGAAATTTTTTTAATTTAATTTTTTGTATTAAAG... cds-CAD5126495.1 ATGGATTTAGACTTTTTTGCTATATGGTCACCAACTGTAAATTTGA... cds-CAD5126496.1 ATGGTTATATGCATATTAAGAACTACAGACAACAAAATGAGAATAA... Name: Sequence, dtype: object Filtering by sequence --------------------- The ``Sequence`` column can be used to filter the PyRanges object. For example, let's get intervals whose sequence starts with a G: >>> g_sg= sg[sg.Sequence.str.startswith('G')] >>> g_sg[ ['ID', 'Sequence'] ] # show only ID and Sequence to allow display ID Sequence 136 cds-CAD5125115.1 GGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA... 147 cds-CAD5125114.1 GGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCAGA... 138 cds-CAD5125115.1 GTTCAGTTGGAAGTTGAAACGAAAATTGTTCAGGATCAATTAAAAG... 149 cds-CAD5125114.1 GTTCAGTTGGAAGTTGAAACGAAAATTGTTCAGGATCAATTAAAAG... 12 cds-CAD5126492.1 GTCTAAAGGTTTTCGATACTTGTTTCAGTAAGTTTCATACATCAAA... 13 cds-CAD5126492.1 GAATTCTTGGACCAACTACTAGTGCTATATCCGAAACTATATCCAC... 42 cds-CAD5126496.1 GTTAAATGGTAAAAACTATGGTGGAAATCCTATACCTGAAAAAAGT... We can use a regular expression as a filter. Let's get those with CC, followed by 1 to 3 characters, then AA: >>> pat_sg = sg[sg.Sequence.str.contains(r'CCC.{1,3}AA', regex=True)] >>> pat_sg[ ['ID', 'Sequence'] ] # show only ID and Sequence to allow display ID Sequence 14 cds-CAD5126492.1 TCTCTTGCTTCATTAATGGAAAAATTTGGCTGGAATTTTATTCTAA... 43 cds-CAD5126496.1 TTCCAAATGATGATGCTACCCATAAAATATGTCTCTTTGAAACATT... 44 cds-CAD5126496.1 TGTTTCTGTTGTTGATCCCACTAAGGCATCGGTAGATCTTACTGGC... Let's check the mRNA sequence instead. Let's get the sequences with TTTT in their first 200 nucleotides: >>> z = mrna_seq[mrna_seq.str[:200].str.contains('TTTT')] >>> z # doctest: +NORMALIZE_WHITESPACE ID cds-CAD5125115.1 ATGGGCTATAATATAGGAATACGTTTAATAGAAGACTTTCTAGCCA... cds-CAD5126492.1 ATGAAGATTTTTGCTATAATATCAATATATTTTATATTATCTGAGT... cds-CAD5126493.1 ATGAATTTCTATAGAAATTTTTTTAATTTAATTTTTTGTATTAAAG... cds-CAD5126495.1 ATGGATTTAGACTTTTTTGCTATATGGTCACCAACTGTAAATTTGA... cds-CAD5126496.1 ATGGTTATATGCATATTAAGAACTACAGACAACAAAATGAGAATAA... Name: Sequence, dtype: object Now let's filter the original PyRanges object to get all interval groups with these IDs: >>> zsg = sg[sg.ID.isin(z.index)] >>> zsg index | Chromosome Start End Strand ID ... int64 | category int64 int64 category str ... ------- --- ----------------- ------- ------- ---------- ---------------- ----- 135 | CAJFCJ010000025.1 2753 2755 - cds-CAD5125115.1 ... 136 | CAJFCJ010000025.1 2593 2693 - cds-CAD5125115.1 ... 137 | CAJFCJ010000025.1 2354 2537 - cds-CAD5125115.1 ... 138 | CAJFCJ010000025.1 2174 2294 - cds-CAD5125115.1 ... ... | ... ... ... ... ... ... 41 | CAJFCJ010000053.1 77280 77395 + cds-CAD5126496.1 ... 42 | CAJFCJ010000053.1 77458 77655 + cds-CAD5126496.1 ... 43 | CAJFCJ010000053.1 77724 77839 + cds-CAD5126496.1 ... 44 | CAJFCJ010000053.1 77899 78770 + cds-CAD5126496.1 ... PyRanges with 15 rows, 6 columns, and 1 index columns. (1 columns not shown: "Sequence"). Contains 2 chromosomes and 2 strands. Translation and reverse complement ---------------------------------- The submodule :mod:`pyranges1.seqs` contains functions to translate sequences and reverse complement them. They can operate on a single sequence, or a Series of sequences as below: >>> sg['Protein'] = pr.seqs.translate(sg.Sequence) >>> sg['RevComp'] = pr.seqs.reverse_complement(sg.Sequence) >>> sg[ ['ID', 'Sequence', 'Protein', 'RevComp'] ].head(1).transpose() # to allow display 145 ID cds-CAD5125114.1 Sequence ATGTCGCGACAGAGTGGGAGATCAAATGATCCTCGAAAAGTA Protein MSRQSGRSNDPRKV RevComp TACTTTTCGAGGATCATTTGATCTCCCACTCTGTCGCGACAT