pyranges.readers
Module Contents
Functions
|
Deduplicate columns from GTF attributes that share names |
|
Return bed file as PyRanges. |
|
Return bam file as PyRanges. |
|
|
|
|
|
Read files in the Gene Transfer Format. |
|
|
|
|
|
|
|
|
|
seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below. |
|
|
|
Read files in the General Feature Format. |
|
- pyranges.readers.rename_core_attrs(df, ftype, rename_attr=False)
Deduplicate columns from GTF attributes that share names with the default 8 columns by appending “_attr” to each name if rename_attr==True. Otherwise throw an error informing user of formatting issues.
- Parameters:
df (pandas DataFrame) – DataFrame from read_gtf
ftype (str) –
{‘gtf’ or ‘gff3’}
rename_attr : bool, default False
Whether to rename (potential) attributes with reserved column names with the suffix ‘_attr’ or to just raise an error (default)
- Returns:
df – DataFrame with deduplicated column names
- Return type:
pandas DataFrame
- pyranges.readers.read_bed(f, as_df=False, nrows=None)
Return bed file as PyRanges.
This is a reader for files that follow the bed format. They can have from 3-12 columns which will be named like so:
Chromosome Start End Name Score Strand ThickStart ThickEnd ItemRGB BlockCount BlockSizes BlockStarts
- Parameters:
f (str) – Path to bed file
as_df (bool, default False) – Whether to return as pandas DataFrame instead of PyRanges.
nrows (int, default None) – Number of rows to return.
Notes
If you just want to create a PyRanges from a tab-delimited bed-like file, use pr.PyRanges(pandas.read_table(f)) instead.
Examples
>>> path = pr.get_example_path("aorta.bed") >>> pr.read_bed(path, nrows=5) +--------------+-----------+-----------+------------+-----------+--------------+ | Chromosome | Start | End | Name | Score | Strand | | (category) | (int64) | (int64) | (object) | (int64) | (category) | |--------------+-----------+-----------+------------+-----------+--------------| | chr1 | 9939 | 10138 | H3K27me3 | 7 | + | | chr1 | 9953 | 10152 | H3K27me3 | 5 | + | | chr1 | 9916 | 10115 | H3K27me3 | 5 | - | | chr1 | 9951 | 10150 | H3K27me3 | 8 | - | | chr1 | 9978 | 10177 | H3K27me3 | 7 | - | +--------------+-----------+-----------+------------+-----------+--------------+ Stranded PyRanges object has 5 rows and 6 columns from 1 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand.
>>> pr.read_bed(path, as_df=True, nrows=5) Chromosome Start End Name Score Strand 0 chr1 9916 10115 H3K27me3 5 - 1 chr1 9939 10138 H3K27me3 7 + 2 chr1 9951 10150 H3K27me3 8 - 3 chr1 9953 10152 H3K27me3 5 + 4 chr1 9978 10177 H3K27me3 7 -
- pyranges.readers.read_bam(f, sparse=True, as_df=False, mapq=0, required_flag=0, filter_flag=1540)
Return bam file as PyRanges.
- Parameters:
f (str) – Path to bam file
sparse (bool, default True) – Whether to return only.
as_df (bool, default False) – Whether to return as pandas DataFrame instead of PyRanges.
mapq (int, default 0) – Minimum mapping quality score.
required_flag (int, default 0) – Flags which must be present for the interval to be read.
filter_flag (int, default 1540) – Ignore reads with these flags. Default 1540, which means that either the read is unmapped, the read failed vendor or platfrom quality checks, or the read is a PCR or optical duplicate.
Notes
This functionality requires the library bamread. It can be installed with pip install bamread or conda install -c bioconda bamread.
Examples
>>> path = pr.get_example_path("control.bam") >>> pr.read_bam(path).sort() +--------------+-----------+-----------+--------------+------------+ | Chromosome | Start | End | Strand | Flag | | (category) | (int64) | (int64) | (category) | (uint16) | |--------------+-----------+-----------+--------------+------------| | chr1 | 1041102 | 1041127 | + | 0 | | chr1 | 2129359 | 2129384 | + | 0 | | chr1 | 2239108 | 2239133 | + | 0 | | chr1 | 2318805 | 2318830 | + | 0 | | ... | ... | ... | ... | ... | | chrY | 10632456 | 10632481 | - | 16 | | chrY | 11918814 | 11918839 | - | 16 | | chrY | 11936866 | 11936891 | - | 16 | | chrY | 57402214 | 57402239 | - | 16 | +--------------+-----------+-----------+--------------+------------+ Stranded PyRanges object has 10,000 rows and 5 columns from 25 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand.
- pyranges.readers._fetch_gene_transcript_exon_id(attribute, annotation=None)
- pyranges.readers.skiprows(f)
- pyranges.readers.read_gtf(f, full=True, as_df=False, nrows=None, duplicate_attr=False, rename_attr=False, ignore_bad: bool = False)
Read files in the Gene Transfer Format.
- Parameters:
f (str) – Path to GTF file.
full (bool, default True) – Whether to read and interpret the annotation column.
as_df (bool, default False) – Whether to return as pandas DataFrame instead of PyRanges.
nrows (int, default None) – Number of rows to read. Default None, i.e. all.
duplicate_attr (bool, default False) – Whether to handle (potential) duplicate attributes or just keep last one.
rename_attr (bool, default False) – Whether to rename (potential) attributes with reserved column names with the suffix ‘_attr’ or to just raise an error (default)
ignore_bad (bool, default False) – Whether to ignore bad lines or raise an error.
Note
The GTF format encodes both Start and End as 1-based included. PyRanges (and also the DF returned by this function, if as_df=True), instead encodes intervals as 0-based, Start included and End excluded.
See also
pyranges.read_gff3
read files in the General Feature Format
Examples
>>> path = pr.get_example_path("ensembl.gtf") >>> gr = pr.read_gtf(path)
>>> # +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------+ >>> # | Chromosome | Source | Feature | Start | End | Score | Strand | Frame | gene_id | gene_version | +18 | >>> # | (category) | (object) | (category) | (int64) | (int64) | (object) | (category) | (object) | (object) | (object) | ... | >>> # |--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------| >>> # | 1 | havana | gene | 11868 | 14409 | . | + | . | ENSG00000223972 | 5 | ... | >>> # | 1 | havana | transcript | 11868 | 14409 | . | + | . | ENSG00000223972 | 5 | ... | >>> # | 1 | havana | exon | 11868 | 12227 | . | + | . | ENSG00000223972 | 5 | ... | >>> # | 1 | havana | exon | 12612 | 12721 | . | + | . | ENSG00000223972 | 5 | ... | >>> # | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | >>> # | 1 | ensembl | transcript | 120724 | 133723 | . | - | . | ENSG00000238009 | 6 | ... | >>> # | 1 | ensembl | exon | 133373 | 133723 | . | - | . | ENSG00000238009 | 6 | ... | >>> # | 1 | ensembl | exon | 129054 | 129223 | . | - | . | ENSG00000238009 | 6 | ... | >>> # | 1 | ensembl | exon | 120873 | 120932 | . | - | . | ENSG00000238009 | 6 | ... | >>> # +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------+ >>> # Stranded PyRanges object has 95 rows and 28 columns from 1 chromosomes. >>> # For printing, the PyRanges was sorted on Chromosome and Strand. >>> # 18 hidden columns: gene_name, gene_source, gene_biotype, transcript_id, transcript_version, transcript_name, transcript_source, transcript_biotype, tag, transcript_support_level, ... (+ 8 more.)
- pyranges.readers.read_gtf_full(f, as_df=False, nrows=None, skiprows=0, duplicate_attr=False, rename_attr=False, ignore_bad: bool = False, chunksize: int = int(100000.0))
- pyranges.readers.parse_kv_fields(line)
- pyranges.readers.to_rows(anno, ignore_bad: bool = False)
- pyranges.readers.to_rows_keep_duplicates(anno, ignore_bad: bool = False)
- pyranges.readers.read_gtf_restricted(f, skiprows, as_df=False, nrows=None)
seqname - name of the chromosome or scaffold; chromosome names can be given with or without the ‘chr’ prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below. # source - name of the program that generated this feature, or the data source (database or project name) feature - feature type name, e.g. Gene, Variation, Similarity start - Start position of the feature, with sequence numbering starting at 1. end - End position of the feature, with sequence numbering starting at 1. score - A floating point value. strand - defined as + (forward) or - (reverse). # frame - One of ‘0’, ‘1’ or ‘2’. ‘0’ indicates that the first base of the feature is the first base of a codon, ‘1’ that the second base is the first base of a codon, and so on.. attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.
- pyranges.readers.to_rows_gff3(anno)
- pyranges.readers.read_gff3(f, full=True, annotation=None, as_df=False, nrows=None)
Read files in the General Feature Format.
- Parameters:
f (str) – Path to GFF file.
full (bool, default True) – Whether to read and interpret the annotation column.
as_df (bool, default False) – Whether to return as pandas DataFrame instead of PyRanges.
nrows (int, default None) – Number of rows to read. Default None, i.e. all.
Notes
The gff3 format encodes both Start and End as 1-based included. PyRanges (and also the DF returned by this function, if as_df=True), instead encodes intervals as 0-based, Start included and End excluded.
See also
pyranges.read_gtf
read files in the Gene Transfer Format
- pyranges.readers.read_bigwig(f, as_df=False)