Columns operations
Fetching or writing a column
Most column operations are analogous to pandas. A single PyRanges column (which are pandas Series) can be extracted through the dot notation, when reading it:
>>> import pyranges1 as pr
>>> gr = pr.example_data.chipseq
>>> gr
index | Chromosome Start End Name Score Strand
int64 | category int64 int64 str int64 category
------- --- ------------ --------- --------- ------ ------- ----------
0 | chr8 28510032 28510057 U0 0 -
1 | chr7 107153363 107153388 U0 0 -
2 | chr5 135821802 135821827 U0 0 -
3 | chr14 19418999 19419024 U0 0 -
... | ... ... ... ... ... ...
16 | chr9 120803448 120803473 U0 0 +
17 | chr6 89296757 89296782 U0 0 -
18 | chr1 194245558 194245583 U0 0 +
19 | chr8 57916061 57916086 U0 0 +
PyRanges with 20 rows, 6 columns, and 1 index columns.
Contains 15 chromosomes and 2 strands.
>>> gr.Chromosome.head()
0 chr8
1 chr7
2 chr5
3 chr14
4 chr12
Name: Chromosome, dtype: category
Categories (15, str): ['chr1', 'chr10', 'chr11', 'chr12', ..., 'chr6', 'chr7', 'chr8', 'chr9']
>>> ( (gr.End + gr.Start)/2 ).head()
0 28510044.5
1 107153375.5
2 135821814.5
3 19419011.5
4 106679773.5
dtype: float64
The gr[column_name] syntax also extracts a column from a PyRanges object:
>>> gr['Chromosome'].head()
0 chr8
1 chr7
2 chr5
3 chr14
4 chr12
Name: Chromosome, dtype: category
Categories (15, str): ['chr1', 'chr10', 'chr11', 'chr12', ..., 'chr6', 'chr7', 'chr8', 'chr9']
The gr[column_name] syntax is the only one accepted for assignment (i.e. create or edit a column):
>>> gr['newchr'] = gr['Chromosome'].str.replace('chr', '')
>>> gr
index | Chromosome Start End Name Score Strand newchr
int64 | category int64 int64 str int64 category str
------- --- ------------ --------- --------- ------ ------- ---------- --------
0 | chr8 28510032 28510057 U0 0 - 8
1 | chr7 107153363 107153388 U0 0 - 7
2 | chr5 135821802 135821827 U0 0 - 5
3 | chr14 19418999 19419024 U0 0 - 14
... | ... ... ... ... ... ... ...
16 | chr9 120803448 120803473 U0 0 + 9
17 | chr6 89296757 89296782 U0 0 - 6
18 | chr1 194245558 194245583 U0 0 + 1
19 | chr8 57916061 57916086 U0 0 + 8
PyRanges with 20 rows, 7 columns, and 1 index columns.
Contains 15 chromosomes and 2 strands.
Extracting multiple columns
As in pandas, you can extract a dataframe with a subset of columns by indexing it with list of column names:
>>> gr[ ['Chromosome', 'Start'] ].head()
Chromosome Start
0 chr8 28510032
1 chr7 107153363
2 chr5 135821802
3 chr14 19418999
4 chr12 106679761
When the resulting dataframe has all required genomic location columns (Chromosome, Start, End), then a PyRanges is returned:
>>> gr[ ['Chromosome', 'Start', 'End', 'Name'] ].head()
index | Chromosome Start End Name
int64 | category int64 int64 str
------- --- ------------ --------- --------- ------
0 | chr8 28510032 28510057 U0
1 | chr7 107153363 107153388 U0
2 | chr5 135821802 135821827 U0
3 | chr14 19418999 19419024 U0
4 | chr12 106679761 106679786 U0
PyRanges with 5 rows, 4 columns, and 1 index columns.
Contains 5 chromosomes.
The method get_with_loc_columns is a shortcut to extract
any column together with the genomic location columns:
>>> gr.get_with_loc_columns('Name').head()
index | Chromosome Start End Strand Name
int64 | category int64 int64 category str
------- --- ------------ --------- --------- ---------- ------
0 | chr8 28510032 28510057 - U0
1 | chr7 107153363 107153388 - U0
2 | chr5 135821802 135821827 - U0
3 | chr14 19418999 19419024 - U0
4 | chr12 106679761 106679786 - U0
PyRanges with 5 rows, 5 columns, and 1 index columns.
Contains 5 chromosomes and 1 strands.
>>> gr.get_with_loc_columns(['Name', 'Score']).head()
index | Chromosome Start End Strand Name Score
int64 | category int64 int64 category str int64
------- --- ------------ --------- --------- ---------- ------ -------
0 | chr8 28510032 28510057 - U0 0
1 | chr7 107153363 107153388 - U0 0
2 | chr5 135821802 135821827 - U0 0
3 | chr14 19418999 19419024 - U0 0
4 | chr12 106679761 106679786 - U0 0
PyRanges with 5 rows, 6 columns, and 1 index columns.
Contains 5 chromosomes and 1 strands.
Dropping columns
Alternatively, you can specify which columns to remove with the pandas dataframe drop method.
Again, a PyRanges object is returned only if genomic location columns are maintained:
>>> gr.drop('Name', axis=1)
index | Chromosome Start End Score Strand newchr
int64 | category int64 int64 int64 category str
------- --- ------------ --------- --------- ------- ---------- --------
0 | chr8 28510032 28510057 0 - 8
1 | chr7 107153363 107153388 0 - 7
2 | chr5 135821802 135821827 0 - 5
3 | chr14 19418999 19419024 0 - 14
... | ... ... ... ... ... ...
16 | chr9 120803448 120803473 0 + 9
17 | chr6 89296757 89296782 0 - 6
18 | chr1 194245558 194245583 0 + 1
19 | chr8 57916061 57916086 0 + 8
PyRanges with 20 rows, 6 columns, and 1 index columns.
Contains 15 chromosomes and 2 strands.
>>> gr.drop(['Name', 'Chromosome', 'newchr'], axis=1).head()
Start End Score Strand
0 28510032 28510057 0 -
1 107153363 107153388 0 -
2 135821802 135821827 0 -
3 19418999 19419024 0 -
4 106679761 106679786 0 -
The PyRanges method remove_strand is a shortcut to remove the Strand column:
>>> gr.remove_strand().head()
index | Chromosome Start End Name Score newchr
int64 | category int64 int64 str int64 str
------- --- ------------ --------- --------- ------ ------- --------
0 | chr8 28510032 28510057 U0 0 8
1 | chr7 107153363 107153388 U0 0 7
2 | chr5 135821802 135821827 U0 0 5
3 | chr14 19418999 19419024 U0 0 14
4 | chr12 106679761 106679786 U0 0 12
PyRanges with 5 rows, 6 columns, and 1 index columns.
Contains 5 chromosomes.