Mapping between coordinate systems
Mapping coordinate systems: cheatsheet
overlap.
Nested coordinate systems
PyRanges may represent nested coordinate systems, where the intervals in one coordinate system are relative to the intervals in another coordinate system. This is useful for representing hierarchical data.
Let’s make an intuitive case. On the one hand, we have some global ranges, that is genomic intervals wherein coordinates refer to positions along the genome:
>>> import pyranges1 as pr, pandas as pd
>>> gr = pr.example_data.ncbi_gff
>>> gre = (gr[gr.Feature=='exon']).get_with_loc_columns('Parent')
>>> gre
index | Chromosome Start End Strand Parent
int64 | category int64 int64 category str
------- --- ----------------- ------- ------- ---------- ---------------------
3 | CAJFCJ010000053.1 4882 5264 - rna-DGYR_LOCUS13733
7 | CAJFCJ010000053.1 10474 10958 + rna-DGYR_LOCUS13734
8 | CAJFCJ010000053.1 11028 11169 + rna-DGYR_LOCUS13734
9 | CAJFCJ010000053.1 11227 11400 + rna-DGYR_LOCUS13734
... | ... ... ... ... ...
141 | CAJFCJ010000025.1 2753 2851 - rna-DGYR_LOCUS12552-2
142 | CAJFCJ010000025.1 2593 2693 - rna-DGYR_LOCUS12552-2
143 | CAJFCJ010000025.1 2354 2537 - rna-DGYR_LOCUS12552-2
144 | CAJFCJ010000025.1 1909 2294 - rna-DGYR_LOCUS12552-2
PyRanges with 57 rows, 5 columns, and 1 index columns.
Contains 3 chromosomes and 2 strands.
On the other hand, we have local data, i.e. mapped to the transcript sequences that are annotated in the global ranges.
Below, we load some example data that we obtained by running the Rfam database of RNA motifs against the
transcript sequences in the gre object, using the software Infernal:
>>> rh = pr.example_data.rfam_hits
>>> rh = rh[ ['target_name', 'seq_from', 'seq_to', 'strand', 'query_name', 'mdl_from', 'mdl_to'] ]
>>> rh.head()
target_name seq_from seq_to strand query_name mdl_from mdl_to
0 rna-DGYR_LOCUS12552 267 311 + GAIT 1 71
1 rna-DGYR_LOCUS12552-2 288 332 + GAIT 1 71
2 rna-DGYR_LOCUS13738 1641 1678 + REN-SRE 1 37
3 rna-DGYR_LOCUS14091 137 43 - IFN_gamma 1 169
4 rna-DGYR_LOCUS14091 547 616 + snoZ30 1 97
Above, the target_name column contains the transcript IDs, which are
the same as the ID column in the gre object. Then we have coordinates that define the alignment between
portions of those transcripts and one of the RNA motifs (query models) in the Rfam database.
Let’s convert the rh object to a PyRanges object, taking care of switching from 1-based to 0-based coordinates:
>>> rh = rh.rename(columns={'target_name':'Chromosome', 'seq_from':'Start', 'seq_to':'End', 'strand':'Strand'})
>>> rh = pr.PyRanges(rh)
>>> rh['Start'] -= 1 # convert to 0-based coordinates
>>> rh
index | Chromosome Start End Strand query_name mdl_from mdl_to
int64 | str int64 int64 str str int64 int64
------- --- --------------------- ------- ------- -------- ------------ ---------- --------
0 | rna-DGYR_LOCUS12552 266 311 + GAIT 1 71
1 | rna-DGYR_LOCUS12552-2 287 332 + GAIT 1 71
2 | rna-DGYR_LOCUS13738 1640 1678 + REN-SRE 1 37
3 | rna-DGYR_LOCUS14091 136 43 - IFN_gamma 1 169
... | ... ... ... ... ... ... ...
31 | rna-DGYR_LOCUS13737 549 600 + mir-3047 1 61
32 | rna-DGYR_LOCUS13737 543 605 + mir-3156 1 77
33 | rna-DGYR_LOCUS13737 529 612 + MIR1523 1 92
34 | rna-DGYR_LOCUS13737 549 600 + MIR8001 1 67
PyRanges with 35 rows, 7 columns, and 1 index columns.
Contains 13 chromosomes and 2 strands.
Invalid ranges:
* 15 intervals are empty or negative length (end <= start). See indexes: 3, 5, 6, ...
Infernal, like other programs, reports negative stranded hits with Start and End reversed. Let’s fix that:
>>> rh.loc[ rh.Strand == '-', ['Start', 'End'] ] = rh.loc[ rh.Strand == '-', ['End', 'Start'] ].values
>>> rh
index | Chromosome Start End Strand query_name mdl_from mdl_to
int64 | str int64 int64 str str int64 int64
------- --- --------------------- ------- ------- -------- ------------ ---------- --------
0 | rna-DGYR_LOCUS12552 266 311 + GAIT 1 71
1 | rna-DGYR_LOCUS12552-2 287 332 + GAIT 1 71
2 | rna-DGYR_LOCUS13738 1640 1678 + REN-SRE 1 37
3 | rna-DGYR_LOCUS14091 43 136 - IFN_gamma 1 169
... | ... ... ... ... ... ... ...
31 | rna-DGYR_LOCUS13737 549 600 + mir-3047 1 61
32 | rna-DGYR_LOCUS13737 543 605 + mir-3156 1 77
33 | rna-DGYR_LOCUS13737 529 612 + MIR1523 1 92
34 | rna-DGYR_LOCUS13737 549 600 + MIR8001 1 67
PyRanges with 35 rows, 7 columns, and 1 index columns.
Contains 13 chromosomes and 2 strands.
Now we have the gre and rh objects, that represent global and local coordinate systems, respectively.
Let’s check that all Chromosome values in rh matches the ID column in gre:
>>> bool( rh.Chromosome.isin(gre.Parent).all() )
True
Mapping from local to global ranges
Next, we want to take the Rfam hits in rh, which are relative (local) to the transcript sequences,
and remap them to the genome (global) coordinates. To do so, we make use of the information in the gre object,
which defines the coordinates of each transcript, often split in exons, relative to the genome.
For this operation, we use the map_to_global method.
Besides the two PyRanges objects, we also need to specify the columns in the global range
that contains the identifier used as the Chromosome in the local range, provided by the global_on argument.
The resulting PyRanges object, rhg, contains the Rfam hits remapped to the genome coordinates:
>>> rhg = rh.map_to_global(gre, global_on='Parent')
>>> rhg
index | Chromosome Start End Strand query_name mdl_from mdl_to
int64 | category int64 int64 str str int64 int64
------- --- ----------------- ------- ------- -------- ------------ ---------- --------
0 | CAJFCJ010000025.1 2598 2643 - GAIT 1 71
1 | CAJFCJ010000025.1 2598 2643 - GAIT 1 71
2 | CAJFCJ010000053.1 77544 77582 + REN-SRE 1 37
3 | CAJFCJ010000097.1 2291 2384 - IFN_gamma 1 169
... | ... ... ... ... ... ... ...
31 | CAJFCJ010000053.1 39708 39759 + mir-3047 1 61
32 | CAJFCJ010000053.1 39702 39764 + mir-3156 1 77
33 | CAJFCJ010000053.1 39688 39771 + MIR1523 1 92
34 | CAJFCJ010000053.1 39708 39759 + MIR8001 1 67
PyRanges with 38 rows, 7 columns, and 1 index columns (with 3 index duplicates).
Contains 3 chromosomes and 2 strands.
Note that the transcript identifiers are now missing. To keep them, we can use the keep_id argument.
Analogously, we can record the local coordinates by using the keep_loc argument:
>>> rh.map_to_global(gre, global_on='Parent', keep_id=True, keep_loc=True).drop(
... columns=['query_name', 'mdl_from','mdl_to']) # dropping some columns to allow display
index | Chromosome Start End Strand Parent Start_local End_local ...
int64 | category int64 int64 str str int64 int64 ...
------- --- ----------------- ------- ------- -------- --------------------- ------------- ----------- -----
0 | CAJFCJ010000025.1 2598 2643 - rna-DGYR_LOCUS12552 266 311 ...
1 | CAJFCJ010000025.1 2598 2643 - rna-DGYR_LOCUS12552-2 287 332 ...
2 | CAJFCJ010000053.1 77544 77582 + rna-DGYR_LOCUS13738 1640 1678 ...
3 | CAJFCJ010000097.1 2291 2384 - rna-DGYR_LOCUS14091 43 136 ...
... | ... ... ... ... ... ... ... ...
31 | CAJFCJ010000053.1 39708 39759 + rna-DGYR_LOCUS13737 549 600 ...
32 | CAJFCJ010000053.1 39702 39764 + rna-DGYR_LOCUS13737 543 605 ...
33 | CAJFCJ010000053.1 39688 39771 + rna-DGYR_LOCUS13737 529 612 ...
34 | CAJFCJ010000053.1 39708 39759 + rna-DGYR_LOCUS13737 549 600 ...
PyRanges with 38 rows, 8 columns, and 1 index columns (with 3 index duplicates). (1 columns not shown: "Strand_local").
Contains 3 chromosomes and 2 strands.
Let’s now map intervals relative to protein sequences to genome coordinates. First, we obtain the coding sequences (CDS) from the original GFF file, and translate them to protein sequences (see Working with sequences for details):
>>> grc = (gr[gr.Feature=='CDS']).get_with_loc_columns('ID')
>>> genome_file = pr.example_data.files['ncbi.fasta']
>>> cds_seq = grc.get_sequence(genome_file, group_by='ID').str.upper()
>>> pep_seq = pr.seqs.translate(cds_seq)
>>> pep_seq.head()
ID
cds-CAD5125114.1 MSRQSGRSNDPRKVSGELLTLTYGALVAQLVKDSESDDEVNKQLDK...
cds-CAD5125115.1 MGYNIGIRLIEDFLARSSIGKCKDLRETAEIISKNGFKMFLNITPI...
cds-CAD5126491.1 MAKNPEKMSATKKLETINRCMGHTKRGLENGCYTKGLIKIRCFTAE...
cds-CAD5126492.1 MKIFAIISIYFILSESCYFRNVEVEGDFYLATFLAFHTDEYCTGPI...
cds-CAD5126493.1 MNFYRNFFNLIFCIKVSSFSPIQDYISCQEALTKTEQDGSYSIKPR...
Name: Sequence, dtype: object
As an example of a positional feature mapped to protein sequences, let’s find all instances of the amino acid ‘K’ (lysine):
>>> aa='K'
>>> z = [(seq_id, i, aa) for seq_id, seq in pep_seq.items() for i, char in enumerate(seq) if char == aa]
>>> z = pd.DataFrame(z, columns=['ID', 'Start', 'AminoAcid'])
>>> z.head()
ID Start AminoAcid
0 cds-CAD5125114.1 12 K
1 cds-CAD5125114.1 31 K
2 cds-CAD5125114.1 41 K
3 cds-CAD5125114.1 45 K
4 cds-CAD5125114.1 66 K
- Let’s convert to a PyRanges object:
>>> aa_pos = pr.PyRanges(z.rename(columns={'ID':'Chromosome'}).assign(End=lambda df: df.Start + 1 )) >>> aa_pos index | Chromosome Start AminoAcid End int64 | str int64 str int64 ------- --- ---------------- ------- ----------- ------- 0 | cds-CAD5125114.1 12 K 13 1 | cds-CAD5125114.1 31 K 32 2 | cds-CAD5125114.1 41 K 42 3 | cds-CAD5125114.1 45 K 46 ... | ... ... ... ... 457 | cds-CAD5126878.1 334 K 335 458 | cds-CAD5126878.1 341 K 342 459 | cds-CAD5126878.1 342 K 343 460 | cds-CAD5126878.1 350 K 351 PyRanges with 461 rows, 4 columns, and 1 index columns. Contains 17 chromosomes.
Next, we have to convert protein-based positions to nucleotide-based positions, still relative to the CDS. Because they’re in pythonic 0-based coordinates, we just need to multiply by 3:
>>> cds_pos = aa_pos.copy()
>>> cds_pos['Start'] *= 3
>>> cds_pos['End'] *= 3
Now we’re ready to map these positions to the genome coordinates. Let’s also fetch their underlying nucleotide sequence:
>>> genome_pos = cds_pos.map_to_global(grc, global_on='ID', keep_id=True)
>>> genome_pos['Sequence'] = genome_pos.get_sequence(genome_file).str.upper()
>>> genome_pos
index | Chromosome Start AminoAcid End ID Strand Sequence
int64 | category int64 str int64 str category object
------- --- ----------------- ------- ----------- ------- ---------------- ---------- ----------
0 | CAJFCJ010000025.1 3114 K 3117 cds-CAD5125114.1 - AAA
1 | CAJFCJ010000025.1 2797 K 2800 cds-CAD5125114.1 - AAG
2 | CAJFCJ010000025.1 2767 K 2770 cds-CAD5125114.1 - AAA
3 | CAJFCJ010000025.1 2755 K 2758 cds-CAD5125114.1 - AAA
... | ... ... ... ... ... ... ...
457 | CAJFCJ010000097.1 53008 K 53011 cds-CAD5126878.1 + AAG
458 | CAJFCJ010000097.1 53341 K 53344 cds-CAD5126878.1 + AAA
459 | CAJFCJ010000097.1 53344 K 53347 cds-CAD5126878.1 + AAA
460 | CAJFCJ010000097.1 53368 K 53371 cds-CAD5126878.1 + AAA
PyRanges with 466 rows, 7 columns, and 1 index columns (with 5 index duplicates).
Contains 3 chromosomes and 2 strands.
Because protein mapping to genome coordinates is common, map_to_map_to_global
provides a shortcut for this operation, through its pep_to_cds argument, which effectively multiplies the local
coordinates by 3 before mapping to the global coordinates. So this is equivalent to the previous operation:
>>> aa_pos.map_to_global(grc, global_on='ID', keep_id=True, pep_to_cds=True)
index | Chromosome Start AminoAcid End ID Strand
int64 | category int64 str int64 str category
------- --- ----------------- ------- ----------- ------- ---------------- ----------
0 | CAJFCJ010000025.1 3114 K 3117 cds-CAD5125114.1 -
1 | CAJFCJ010000025.1 2797 K 2800 cds-CAD5125114.1 -
2 | CAJFCJ010000025.1 2767 K 2770 cds-CAD5125114.1 -
3 | CAJFCJ010000025.1 2755 K 2758 cds-CAD5125114.1 -
... | ... ... ... ... ... ...
457 | CAJFCJ010000097.1 53008 K 53011 cds-CAD5126878.1 +
458 | CAJFCJ010000097.1 53341 K 53344 cds-CAD5126878.1 +
459 | CAJFCJ010000097.1 53344 K 53347 cds-CAD5126878.1 +
460 | CAJFCJ010000097.1 53368 K 53371 cds-CAD5126878.1 +
PyRanges with 466 rows, 6 columns, and 1 index columns (with 5 index duplicates).
Contains 3 chromosomes and 2 strands.
In the genetic code, the codons for lysine are ‘AAA’ or ‘AAG’, which fits what we see. One last important observation: note the warning above about index duplicates. Let’s take a look at them:
>>> genome_pos[genome_pos.index.duplicated(keep=False)]
index | Chromosome Start AminoAcid End ID Strand Sequence
int64 | category int64 str int64 str category object
------- --- ----------------- ------- ----------- ------- ---------------- ---------- ----------
234 | CAJFCJ010000053.1 77393 K 77395 cds-CAD5126496.1 + AA
234 | CAJFCJ010000053.1 77458 K 77459 cds-CAD5126496.1 + G
282 | CAJFCJ010000053.1 89719 K 89721 cds-CAD5126498.1 - AA
282 | CAJFCJ010000053.1 89660 K 89661 cds-CAD5126498.1 - G
... | ... ... ... ... ... ... ...
422 | CAJFCJ010000097.1 52381 K 52382 cds-CAD5126877.1 + A
422 | CAJFCJ010000097.1 52446 K 52448 cds-CAD5126877.1 + AG
446 | CAJFCJ010000097.1 52381 K 52382 cds-CAD5126878.1 + A
446 | CAJFCJ010000097.1 52446 K 52448 cds-CAD5126878.1 + AG
PyRanges with 10 rows, 7 columns, and 1 index columns (with 5 index duplicates).
Contains 2 chromosomes and 2 strands.
This is because the codon for some amino acids are split between two exons. In more general terms, this is an effect of mapping local features to a global coordinate system: if entities (e.g. CDS) encompasses multiple non-contiguous intervals in the global coordinate system (e.g. exons), a certain local feature (e.g. amino acid) may also be mapped into split intervals in global coordinates. These are identified by the index duplicates in the PyRanges object, as above.
Mapping from global to local ranges
Another task is to map from global to local ranges, which is the opposite of the previous task. This is useful when we put together data generated at different levels. For example, we may have genomic features predicted using the full genome, and we want to see where they reside in transcripts. In this example, we will map the sequence ‘AATAAA’, which is a polyadenylation signal motif, in genome sequences, then map them to transcript coordinates.
>>> pattern='AATAAA'
Let’s remind ourselves of the gre object, which contains the transcript coordinates:
>>> gre
index | Chromosome Start End Strand Parent
int64 | category int64 int64 category str
------- --- ----------------- ------- ------- ---------- ---------------------
3 | CAJFCJ010000053.1 4882 5264 - rna-DGYR_LOCUS13733
7 | CAJFCJ010000053.1 10474 10958 + rna-DGYR_LOCUS13734
8 | CAJFCJ010000053.1 11028 11169 + rna-DGYR_LOCUS13734
9 | CAJFCJ010000053.1 11227 11400 + rna-DGYR_LOCUS13734
... | ... ... ... ... ...
141 | CAJFCJ010000025.1 2753 2851 - rna-DGYR_LOCUS12552-2
142 | CAJFCJ010000025.1 2593 2693 - rna-DGYR_LOCUS12552-2
143 | CAJFCJ010000025.1 2354 2537 - rna-DGYR_LOCUS12552-2
144 | CAJFCJ010000025.1 1909 2294 - rna-DGYR_LOCUS12552-2
PyRanges with 57 rows, 5 columns, and 1 index columns.
Contains 3 chromosomes and 2 strands.
Let’s get the lengths of all these chromosomes from the genome:
>>> import pyfaidx
>>> pyf = pyfaidx.Fasta(genome_file)
>>> chromsizes = {chrom: len(pyf[chrom]) for chrom in gre.chromosomes}
>>> chromsizes
{'CAJFCJ010000025.1': 3418, 'CAJFCJ010000053.1': 109277, 'CAJFCJ010000097.1': 78757}
Let’s now create a PyRanges object than spans their whole sequence, with two intervals per chromosome: one for plus strand and one for minus strand; and let’s load their sequence in memory:
>>> q = [(chrom, 0, chromsizes[chrom], '+') for chrom in chromsizes] + [(chrom, 0, chromsizes[chrom], '-') for chrom in chromsizes]
>>> q
[('CAJFCJ010000025.1', 0, 3418, '+'), ('CAJFCJ010000053.1', 0, 109277, '+'), ('CAJFCJ010000097.1', 0, 78757, '+'), ('CAJFCJ010000025.1', 0, 3418, '-'), ('CAJFCJ010000053.1', 0, 109277, '-'), ('CAJFCJ010000097.1', 0, 78757, '-')]
>>> full_seq = pr.PyRanges( pd.DataFrame(q, columns=['Chromosome', 'Start', 'End', 'Strand']))
>>> full_seq['Sequence'] = full_seq.get_sequence(genome_file).str.upper()
>>> matches = []
>>> for i in full_seq.itertuples():
... st=0
... while (pos_match:= i.Sequence.find(pattern, st)) != -1:
... pat_st, pat_end = pos_match, pos_match+len(pattern)
... if i.Strand == '-':
... pat_st, pat_end = i.End - (pos_match + len(pattern)), i.End - pos_match
... matches.append( (i.Chromosome, pat_st, pat_end, i.Strand) )
... st=pos_match+1
>>> matches = pr.PyRanges( pd.DataFrame(matches, columns=['Chromosome', 'Start', 'End', 'Strand']))
>>> matches['Sequence'] = matches.get_sequence(genome_file).str.upper() # fetch seq as control
>>> matches
index | Chromosome Start End Strand Sequence
int64 | str int64 int64 str object
------- --- ----------------- ------- ------- -------- ----------
0 | CAJFCJ010000025.1 81 87 + AATAAA
1 | CAJFCJ010000025.1 90 96 + AATAAA
2 | CAJFCJ010000025.1 347 353 + AATAAA
3 | CAJFCJ010000025.1 553 559 + AATAAA
... | ... ... ... ... ...
619 | CAJFCJ010000097.1 2513 2519 - AATAAA
620 | CAJFCJ010000097.1 2284 2290 - AATAAA
621 | CAJFCJ010000097.1 2163 2169 - AATAAA
622 | CAJFCJ010000097.1 858 864 - AATAAA
PyRanges with 623 rows, 5 columns, and 1 index columns.
Contains 3 chromosomes and 2 strands.
The object matches contains the positions of the motif in the genome.
Now, we want to map these positions to the transcript coordinates. Function
map_to_local does exactly this.
For each motif, we need to decide which transcript we want to use as reference
coordinate system. By default, it will use all transcripts that overlap the motif:
>>> gre_matches = matches.map_to_local(gre, ref_on='Parent')
>>> gre_matches
index | Chromosome Start End Strand Sequence
int64 | str int64 int64 str object
------- --- ------------------- ------- ------- -------- ----------
28 | rna-DGYR_LOCUS13734 1142 1148 + AATAAA
29 | rna-DGYR_LOCUS13734 1416 1422 + AATAAA
30 | rna-DGYR_LOCUS13734 2106 2112 + AATAAA
31 | rna-DGYR_LOCUS13734 2232 2238 + AATAAA
... | ... ... ... ... ...
617 | rna-DGYR_LOCUS14091 980 986 - AATAAA
618 | rna-DGYR_LOCUS14091 774 780 - AATAAA
619 | rna-DGYR_LOCUS14091 265 271 - AATAAA
620 | rna-DGYR_LOCUS14091 36 42 - AATAAA
PyRanges with 59 rows, 5 columns, and 1 index columns (with 2 index duplicates).
Contains 14 chromosomes and 2 strands.
åAs before, you have options to retain metadata, in this case from the global ranges:
>>> matches.map_to_local(gre, ref_on='Parent', keep_chrom=True, keep_loc=True).columns
Index(['Chromosome', 'Start', 'End', 'Strand', 'Sequence', 'Chromosome_global',
'Start_global', 'End_global', 'Strand_global'],
dtype='str')
Let’s look at another use case. For example, we way have CDS and exon coordinates, and we want to map the CDS coordinates to the exon coordinates. In other words, we want to know where CDS intervals reside along the full mRNA transcripts.
Let’s inspect some data inside the gr object, focusing on a single identifier:
>>> pr.options.set_option('max_rows_to_show', 20)
>>> some_id = 'rna-DGYR_LOCUS12552-2'
>>> gr[gr.Parent == some_id].get_with_loc_columns(['Feature', 'Parent'])
index | Chromosome Start End Strand Feature Parent
int64 | category int64 int64 category category str
------- --- ----------------- ------- ------- ---------- ---------- ---------------------
140 | CAJFCJ010000025.1 3111 3250 - exon rna-DGYR_LOCUS12552-2
141 | CAJFCJ010000025.1 2753 2851 - exon rna-DGYR_LOCUS12552-2
142 | CAJFCJ010000025.1 2593 2693 - exon rna-DGYR_LOCUS12552-2
143 | CAJFCJ010000025.1 2354 2537 - exon rna-DGYR_LOCUS12552-2
144 | CAJFCJ010000025.1 1909 2294 - exon rna-DGYR_LOCUS12552-2
145 | CAJFCJ010000025.1 3111 3153 - CDS rna-DGYR_LOCUS12552-2
146 | CAJFCJ010000025.1 2753 2851 - CDS rna-DGYR_LOCUS12552-2
147 | CAJFCJ010000025.1 2593 2693 - CDS rna-DGYR_LOCUS12552-2
148 | CAJFCJ010000025.1 2354 2537 - CDS rna-DGYR_LOCUS12552-2
149 | CAJFCJ010000025.1 2174 2294 - CDS rna-DGYR_LOCUS12552-2
PyRanges with 10 rows, 6 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.
You can see that CDS and exon belonging to the same transcript have the same Parent value.
Therefore, we can easily map the CDS coordinates to the exon coordinates of the corresponding transcript,
by pairing (overlapping) CDS and exon intervals with the same Parent value:
>>> pr.options.reset_options()
>>> cds = gr[gr.Feature == 'CDS'].get_with_loc_columns(['Feature', 'Parent'])
>>> exons = gr[gr.Feature == 'exon'].get_with_loc_columns(['Feature', 'Parent'])
>>> cds_local = cds.map_to_local(exons, ref_on='Parent', match_by='Parent')
>>> cds_local
index | Chromosome Start End Strand Feature
int64 | str int64 int64 str category
------- --- --------------------- ------- ------- -------- ----------
0 | rna-DGYR_LOCUS13733 1 382 + CDS
1 | rna-DGYR_LOCUS13734 258 484 + CDS
2 | rna-DGYR_LOCUS13734 484 625 + CDS
3 | rna-DGYR_LOCUS13734 625 798 + CDS
... | ... ... ... ... ...
52 | rna-DGYR_LOCUS12552-2 139 237 + CDS
53 | rna-DGYR_LOCUS12552-2 237 337 + CDS
54 | rna-DGYR_LOCUS12552-2 337 520 + CDS
55 | rna-DGYR_LOCUS12552-2 520 640 + CDS
PyRanges with 56 rows, 5 columns, and 1 index columns.
Contains 17 chromosomes and 1 strands.
>>> cds_local[cds_local.Chromosome == some_id]
index | Chromosome Start End Strand Feature
int64 | str int64 int64 str category
------- --- --------------------- ------- ------- -------- ----------
51 | rna-DGYR_LOCUS12552-2 97 139 + CDS
52 | rna-DGYR_LOCUS12552-2 139 237 + CDS
53 | rna-DGYR_LOCUS12552-2 237 337 + CDS
54 | rna-DGYR_LOCUS12552-2 337 520 + CDS
55 | rna-DGYR_LOCUS12552-2 520 640 + CDS
PyRanges with 5 rows, 5 columns, and 1 index columns.
Contains 1 chromosomes and 1 strands.
With the similar logic, we can easily map the start and stop codon positions to the exon coordinates:
>>> cds.slice_ranges(0, 3).assign(Feature='start').map_to_local(exons, ref_on='Parent', match_by='Parent')
index | Chromosome Start End Strand Feature
int64 | str int64 int64 str str
------- --- --------------------- ------- ------- -------- ---------
0 | rna-DGYR_LOCUS13733 1 4 + start
1 | rna-DGYR_LOCUS13734 258 261 + start
2 | rna-DGYR_LOCUS13734 484 487 + start
3 | rna-DGYR_LOCUS13734 625 628 + start
... | ... ... ... ... ...
52 | rna-DGYR_LOCUS12552-2 139 142 + start
53 | rna-DGYR_LOCUS12552-2 237 240 + start
54 | rna-DGYR_LOCUS12552-2 337 340 + start
55 | rna-DGYR_LOCUS12552-2 520 523 + start
PyRanges with 56 rows, 5 columns, and 1 index columns.
Contains 17 chromosomes and 1 strands.
>>> cds.slice_ranges(-3).assign(Feature='stop').map_to_local(exons, ref_on='Parent', match_by='Parent')
index | Chromosome Start End Strand Feature
int64 | str int64 int64 str str
------- --- --------------------- ------- ------- -------- ---------
0 | rna-DGYR_LOCUS13733 379 382 + stop
1 | rna-DGYR_LOCUS13734 481 484 + stop
2 | rna-DGYR_LOCUS13734 622 625 + stop
3 | rna-DGYR_LOCUS13734 795 798 + stop
... | ... ... ... ... ...
52 | rna-DGYR_LOCUS12552-2 234 237 + stop
53 | rna-DGYR_LOCUS12552-2 334 337 + stop
54 | rna-DGYR_LOCUS12552-2 517 520 + stop
55 | rna-DGYR_LOCUS12552-2 637 640 + stop
PyRanges with 56 rows, 5 columns, and 1 index columns.
Contains 17 chromosomes and 1 strands.