pyranges.readers

Module Contents

Functions

rename_core_attrs(df, ftype[, rename_attr])

Deduplicate columns from GTF attributes that share names

read_bed(f[, as_df, nrows])

Return bed file as PyRanges.

read_bam(f[, sparse, as_df, mapq, required_flag, ...])

Return bam file as PyRanges.

_fetch_gene_transcript_exon_id(attribute[, annotation])

skiprows(f)

read_gtf(f[, full, as_df, nrows, duplicate_attr, ...])

Read files in the Gene Transfer Format.

read_gtf_full(f[, as_df, nrows, skiprows, ...])

parse_kv_fields(line)

to_rows(anno[, ignore_bad])

to_rows_keep_duplicates(anno[, ignore_bad])

read_gtf_restricted(f, skiprows[, as_df, nrows])

seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.

to_rows_gff3(anno)

read_gff3(f[, full, annotation, as_df, nrows])

Read files in the General Feature Format.

read_bigwig(f[, as_df])

pyranges.readers.rename_core_attrs(df, ftype, rename_attr=False)

Deduplicate columns from GTF attributes that share names with the default 8 columns by appending “_attr” to each name if rename_attr==True. Otherwise throw an error informing user of formatting issues.

Parameters:
  • df (pandas DataFrame) – DataFrame from read_gtf

  • ftype (str) –

    {‘gtf’ or ‘gff3’}

    rename_attr : bool, default False

    Whether to rename (potential) attributes with reserved column names with the suffix ‘_attr’ or to just raise an error (default)

Returns:

df – DataFrame with deduplicated column names

Return type:

pandas DataFrame

pyranges.readers.read_bed(f, as_df=False, nrows=None)

Return bed file as PyRanges.

This is a reader for files that follow the bed format. They can have from 3-12 columns which will be named like so:

Chromosome Start End Name Score Strand ThickStart ThickEnd ItemRGB BlockCount BlockSizes BlockStarts

Parameters:
  • f (str) – Path to bed file

  • as_df (bool, default False) – Whether to return as pandas DataFrame instead of PyRanges.

  • nrows (int, default None) – Number of rows to return.

Notes

If you just want to create a PyRanges from a tab-delimited bed-like file, use pr.PyRanges(pandas.read_table(f)) instead.

Examples

>>> path = pr.get_example_path("aorta.bed")
>>> pr.read_bed(path, nrows=5)
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome   |     Start |       End | Name       |     Score | Strand       |
| (category)   |   (int64) |   (int64) | (object)   |   (int64) | (category)   |
|--------------+-----------+-----------+------------+-----------+--------------|
| chr1         |      9939 |     10138 | H3K27me3   |         7 | +            |
| chr1         |      9953 |     10152 | H3K27me3   |         5 | +            |
| chr1         |      9916 |     10115 | H3K27me3   |         5 | -            |
| chr1         |      9951 |     10150 | H3K27me3   |         8 | -            |
| chr1         |      9978 |     10177 | H3K27me3   |         7 | -            |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 5 rows and 6 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
>>> pr.read_bed(path, as_df=True, nrows=5)
  Chromosome  Start    End      Name  Score Strand
0       chr1   9916  10115  H3K27me3      5      -
1       chr1   9939  10138  H3K27me3      7      +
2       chr1   9951  10150  H3K27me3      8      -
3       chr1   9953  10152  H3K27me3      5      +
4       chr1   9978  10177  H3K27me3      7      -
pyranges.readers.read_bam(f, sparse=True, as_df=False, mapq=0, required_flag=0, filter_flag=1540)

Return bam file as PyRanges.

Parameters:
  • f (str) – Path to bam file

  • sparse (bool, default True) – Whether to return only.

  • as_df (bool, default False) – Whether to return as pandas DataFrame instead of PyRanges.

  • mapq (int, default 0) – Minimum mapping quality score.

  • required_flag (int, default 0) – Flags which must be present for the interval to be read.

  • filter_flag (int, default 1540) – Ignore reads with these flags. Default 1540, which means that either the read is unmapped, the read failed vendor or platfrom quality checks, or the read is a PCR or optical duplicate.

Notes

This functionality requires the library bamread. It can be installed with pip install bamread or conda install -c bioconda bamread.

Examples

>>> path = pr.get_example_path("control.bam")
>>> pr.read_bam(path).sort()
+--------------+-----------+-----------+--------------+------------+
| Chromosome   | Start     | End       | Strand       | Flag       |
| (category)   | (int64)   | (int64)   | (category)   | (uint16)   |
|--------------+-----------+-----------+--------------+------------|
| chr1         | 1041102   | 1041127   | +            | 0          |
| chr1         | 2129359   | 2129384   | +            | 0          |
| chr1         | 2239108   | 2239133   | +            | 0          |
| chr1         | 2318805   | 2318830   | +            | 0          |
| ...          | ...       | ...       | ...          | ...        |
| chrY         | 10632456  | 10632481  | -            | 16         |
| chrY         | 11918814  | 11918839  | -            | 16         |
| chrY         | 11936866  | 11936891  | -            | 16         |
| chrY         | 57402214  | 57402239  | -            | 16         |
+--------------+-----------+-----------+--------------+------------+
Stranded PyRanges object has 10,000 rows and 5 columns from 25 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
pyranges.readers._fetch_gene_transcript_exon_id(attribute, annotation=None)
pyranges.readers.skiprows(f)
pyranges.readers.read_gtf(f, full=True, as_df=False, nrows=None, duplicate_attr=False, rename_attr=False, ignore_bad: bool = False)

Read files in the Gene Transfer Format.

Parameters:
  • f (str) – Path to GTF file.

  • full (bool, default True) – Whether to read and interpret the annotation column.

  • as_df (bool, default False) – Whether to return as pandas DataFrame instead of PyRanges.

  • nrows (int, default None) – Number of rows to read. Default None, i.e. all.

  • duplicate_attr (bool, default False) – Whether to handle (potential) duplicate attributes or just keep last one.

  • rename_attr (bool, default False) – Whether to rename (potential) attributes with reserved column names with the suffix ‘_attr’ or to just raise an error (default)

  • ignore_bad (bool, default False) – Whether to ignore bad lines or raise an error.

Note

The GTF format encodes both Start and End as 1-based included. PyRanges (and also the DF returned by this function, if as_df=True), instead encodes intervals as 0-based, Start included and End excluded.

See also

pyranges.read_gff3

read files in the General Feature Format

Examples

>>> path = pr.get_example_path("ensembl.gtf")
>>> gr = pr.read_gtf(path)
>>> # +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------+
>>> # | Chromosome   | Source     | Feature      | Start     | End       | Score      | Strand       | Frame      | gene_id         | gene_version   | +18   |
>>> # | (category)   | (object)   | (category)   | (int64)   | (int64)   | (object)   | (category)   | (object)   | (object)        | (object)       | ...   |
>>> # |--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------|
>>> # | 1            | havana     | gene         | 11868     | 14409     | .          | +            | .          | ENSG00000223972 | 5              | ...   |
>>> # | 1            | havana     | transcript   | 11868     | 14409     | .          | +            | .          | ENSG00000223972 | 5              | ...   |
>>> # | 1            | havana     | exon         | 11868     | 12227     | .          | +            | .          | ENSG00000223972 | 5              | ...   |
>>> # | 1            | havana     | exon         | 12612     | 12721     | .          | +            | .          | ENSG00000223972 | 5              | ...   |
>>> # | ...          | ...        | ...          | ...       | ...       | ...        | ...          | ...        | ...             | ...            | ...   |
>>> # | 1            | ensembl    | transcript   | 120724    | 133723    | .          | -            | .          | ENSG00000238009 | 6              | ...   |
>>> # | 1            | ensembl    | exon         | 133373    | 133723    | .          | -            | .          | ENSG00000238009 | 6              | ...   |
>>> # | 1            | ensembl    | exon         | 129054    | 129223    | .          | -            | .          | ENSG00000238009 | 6              | ...   |
>>> # | 1            | ensembl    | exon         | 120873    | 120932    | .          | -            | .          | ENSG00000238009 | 6              | ...   |
>>> # +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------+
>>> # Stranded PyRanges object has 95 rows and 28 columns from 1 chromosomes.
>>> # For printing, the PyRanges was sorted on Chromosome and Strand.
>>> # 18 hidden columns: gene_name, gene_source, gene_biotype, transcript_id, transcript_version, transcript_name, transcript_source, transcript_biotype, tag, transcript_support_level, ... (+ 8 more.)
pyranges.readers.read_gtf_full(f, as_df=False, nrows=None, skiprows=0, duplicate_attr=False, rename_attr=False, ignore_bad: bool = False, chunksize: int = int(100000.0))
pyranges.readers.parse_kv_fields(line)
pyranges.readers.to_rows(anno, ignore_bad: bool = False)
pyranges.readers.to_rows_keep_duplicates(anno, ignore_bad: bool = False)
pyranges.readers.read_gtf_restricted(f, skiprows, as_df=False, nrows=None)

seqname - name of the chromosome or scaffold; chromosome names can be given with or without the ‘chr’ prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below. # source - name of the program that generated this feature, or the data source (database or project name) feature - feature type name, e.g. Gene, Variation, Similarity start - Start position of the feature, with sequence numbering starting at 1. end - End position of the feature, with sequence numbering starting at 1. score - A floating point value. strand - defined as + (forward) or - (reverse). # frame - One of ‘0’, ‘1’ or ‘2’. ‘0’ indicates that the first base of the feature is the first base of a codon, ‘1’ that the second base is the first base of a codon, and so on.. attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

pyranges.readers.to_rows_gff3(anno)
pyranges.readers.read_gff3(f, full=True, annotation=None, as_df=False, nrows=None)

Read files in the General Feature Format.

Parameters:
  • f (str) – Path to GFF file.

  • full (bool, default True) – Whether to read and interpret the annotation column.

  • as_df (bool, default False) – Whether to return as pandas DataFrame instead of PyRanges.

  • nrows (int, default None) – Number of rows to read. Default None, i.e. all.

Notes

The gff3 format encodes both Start and End as 1-based included. PyRanges (and also the DF returned by this function, if as_df=True), instead encodes intervals as 0-based, Start included and End excluded.

See also

pyranges.read_gtf

read files in the Gene Transfer Format

pyranges.readers.read_bigwig(f, as_df=False)