:py:mod:`pyranges.readers`
==========================

.. py:module:: pyranges.readers


Module Contents
---------------


Functions
~~~~~~~~~

.. autoapisummary::

   pyranges.readers.rename_core_attrs
   pyranges.readers.read_bed
   pyranges.readers.read_bam
   pyranges.readers._fetch_gene_transcript_exon_id
   pyranges.readers.skiprows
   pyranges.readers.read_gtf
   pyranges.readers.read_gtf_full
   pyranges.readers.parse_kv_fields
   pyranges.readers.to_rows
   pyranges.readers.to_rows_keep_duplicates
   pyranges.readers.read_gtf_restricted
   pyranges.readers.to_rows_gff3
   pyranges.readers.read_gff3
   pyranges.readers.read_bigwig


.. py:function:: rename_core_attrs(df, ftype, rename_attr=False)

   Deduplicate columns from GTF attributes that share names
   with the default 8 columns by appending "_attr" to each name if
   rename_attr==True. Otherwise throw an error informing user of
   formatting issues.

   :param df: DataFrame from read_gtf
   :type df: pandas DataFrame
   :param ftype: {'gtf' or 'gff3'}

                 rename_attr : bool, default False

                     Whether to rename (potential) attributes with reserved column names
                     with the suffix '_attr' or to just raise an error (default)
   :type ftype: str

   :returns: **df** -- DataFrame with deduplicated column names
   :rtype: pandas DataFrame


.. py:function:: read_bed(f, as_df=False, nrows=None)

   Return bed file as PyRanges.

   This is a reader for files that follow the bed format. They can have from
   3-12 columns which will be named like so:

   Chromosome Start End Name Score Strand ThickStart ThickEnd ItemRGB
   BlockCount BlockSizes BlockStarts

   :param f: Path to bed file
   :type f: str
   :param as_df: Whether to return as pandas DataFrame instead of PyRanges.
   :type as_df: bool, default False
   :param nrows: Number of rows to return.
   :type nrows: int, default None

   .. rubric:: Notes

   If you just want to create a PyRanges from a tab-delimited bed-like file,
   use `pr.PyRanges(pandas.read_table(f))` instead.

   .. rubric:: Examples

   >>> path = pr.get_example_path("aorta.bed")
   >>> pr.read_bed(path, nrows=5)
   +--------------+-----------+-----------+------------+-----------+--------------+
   | Chromosome   |     Start |       End | Name       |     Score | Strand       |
   | (category)   |   (int64) |   (int64) | (object)   |   (int64) | (category)   |
   |--------------+-----------+-----------+------------+-----------+--------------|
   | chr1         |      9939 |     10138 | H3K27me3   |         7 | +            |
   | chr1         |      9953 |     10152 | H3K27me3   |         5 | +            |
   | chr1         |      9916 |     10115 | H3K27me3   |         5 | -            |
   | chr1         |      9951 |     10150 | H3K27me3   |         8 | -            |
   | chr1         |      9978 |     10177 | H3K27me3   |         7 | -            |
   +--------------+-----------+-----------+------------+-----------+--------------+
   Stranded PyRanges object has 5 rows and 6 columns from 1 chromosomes.
   For printing, the PyRanges was sorted on Chromosome and Strand.

   >>> pr.read_bed(path, as_df=True, nrows=5)
     Chromosome  Start    End      Name  Score Strand
   0       chr1   9916  10115  H3K27me3      5      -
   1       chr1   9939  10138  H3K27me3      7      +
   2       chr1   9951  10150  H3K27me3      8      -
   3       chr1   9953  10152  H3K27me3      5      +
   4       chr1   9978  10177  H3K27me3      7      -


.. py:function:: read_bam(f, sparse=True, as_df=False, mapq=0, required_flag=0, filter_flag=1540)

   Return bam file as PyRanges.

   :param f: Path to bam file
   :type f: str
   :param sparse: Whether to return only.
   :type sparse: bool, default True
   :param as_df: Whether to return as pandas DataFrame instead of PyRanges.
   :type as_df: bool, default False
   :param mapq: Minimum mapping quality score.
   :type mapq: int, default 0
   :param required_flag: Flags which must be present for the interval to be read.
   :type required_flag: int, default 0
   :param filter_flag: Ignore reads with these flags. Default 1540, which means that either
                       the read is unmapped, the read failed vendor or platfrom quality
                       checks, or the read is a PCR or optical duplicate.
   :type filter_flag: int, default 1540

   .. rubric:: Notes

   This functionality requires the library `bamread`. It can be installed with
   `pip install bamread` or `conda install -c bioconda bamread`.

   .. rubric:: Examples

   >>> path = pr.get_example_path("control.bam")
   >>> pr.read_bam(path).sort()
   +--------------+-----------+-----------+--------------+------------+
   | Chromosome   | Start     | End       | Strand       | Flag       |
   | (category)   | (int64)   | (int64)   | (category)   | (uint16)   |
   |--------------+-----------+-----------+--------------+------------|
   | chr1         | 1041102   | 1041127   | +            | 0          |
   | chr1         | 2129359   | 2129384   | +            | 0          |
   | chr1         | 2239108   | 2239133   | +            | 0          |
   | chr1         | 2318805   | 2318830   | +            | 0          |
   | ...          | ...       | ...       | ...          | ...        |
   | chrY         | 10632456  | 10632481  | -            | 16         |
   | chrY         | 11918814  | 11918839  | -            | 16         |
   | chrY         | 11936866  | 11936891  | -            | 16         |
   | chrY         | 57402214  | 57402239  | -            | 16         |
   +--------------+-----------+-----------+--------------+------------+
   Stranded PyRanges object has 10,000 rows and 5 columns from 25 chromosomes.
   For printing, the PyRanges was sorted on Chromosome and Strand.


.. py:function:: _fetch_gene_transcript_exon_id(attribute, annotation=None)


.. py:function:: skiprows(f)


.. py:function:: read_gtf(f, full=True, as_df=False, nrows=None, duplicate_attr=False, rename_attr=False, ignore_bad: bool = False)

   Read files in the Gene Transfer Format.

   :param f: Path to GTF file.
   :type f: str
   :param full: Whether to read and interpret the annotation column.
   :type full: bool, default True
   :param as_df: Whether to return as pandas DataFrame instead of PyRanges.
   :type as_df: bool, default False
   :param nrows: Number of rows to read. Default None, i.e. all.
   :type nrows: int, default None
   :param duplicate_attr: Whether to handle (potential) duplicate attributes or just keep last one.
   :type duplicate_attr: bool, default False
   :param rename_attr: Whether to rename (potential) attributes with reserved column names
                       with the suffix '_attr' or to just raise an error (default)
   :type rename_attr: bool, default False
   :param ignore_bad: Whether to ignore bad lines or raise an error.
   :type ignore_bad: bool, default False

   .. note::

      The GTF format encodes both Start and End as 1-based included.
      PyRanges (and also the DF returned by this function, if as_df=True), instead
      encodes intervals as 0-based, Start included and End excluded.

   .. seealso::

      :obj:`pyranges.read_gff3`
          read files in the General Feature Format

   .. rubric:: Examples

   >>> path = pr.get_example_path("ensembl.gtf")
   >>> gr = pr.read_gtf(path)

   >>> # +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------+
   >>> # | Chromosome   | Source     | Feature      | Start     | End       | Score      | Strand       | Frame      | gene_id         | gene_version   | +18   |
   >>> # | (category)   | (object)   | (category)   | (int64)   | (int64)   | (object)   | (category)   | (object)   | (object)        | (object)       | ...   |
   >>> # |--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------|
   >>> # | 1            | havana     | gene         | 11868     | 14409     | .          | +            | .          | ENSG00000223972 | 5              | ...   |
   >>> # | 1            | havana     | transcript   | 11868     | 14409     | .          | +            | .          | ENSG00000223972 | 5              | ...   |
   >>> # | 1            | havana     | exon         | 11868     | 12227     | .          | +            | .          | ENSG00000223972 | 5              | ...   |
   >>> # | 1            | havana     | exon         | 12612     | 12721     | .          | +            | .          | ENSG00000223972 | 5              | ...   |
   >>> # | ...          | ...        | ...          | ...       | ...       | ...        | ...          | ...        | ...             | ...            | ...   |
   >>> # | 1            | ensembl    | transcript   | 120724    | 133723    | .          | -            | .          | ENSG00000238009 | 6              | ...   |
   >>> # | 1            | ensembl    | exon         | 133373    | 133723    | .          | -            | .          | ENSG00000238009 | 6              | ...   |
   >>> # | 1            | ensembl    | exon         | 129054    | 129223    | .          | -            | .          | ENSG00000238009 | 6              | ...   |
   >>> # | 1            | ensembl    | exon         | 120873    | 120932    | .          | -            | .          | ENSG00000238009 | 6              | ...   |
   >>> # +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------+
   >>> # Stranded PyRanges object has 95 rows and 28 columns from 1 chromosomes.
   >>> # For printing, the PyRanges was sorted on Chromosome and Strand.
   >>> # 18 hidden columns: gene_name, gene_source, gene_biotype, transcript_id, transcript_version, transcript_name, transcript_source, transcript_biotype, tag, transcript_support_level, ... (+ 8 more.)


.. py:function:: read_gtf_full(f, as_df=False, nrows=None, skiprows=0, duplicate_attr=False, rename_attr=False, ignore_bad: bool = False, chunksize: int = int(100000.0))


.. py:function:: parse_kv_fields(line)


.. py:function:: to_rows(anno, ignore_bad: bool = False)


.. py:function:: to_rows_keep_duplicates(anno, ignore_bad: bool = False)


.. py:function:: read_gtf_restricted(f, skiprows, as_df=False, nrows=None)

   seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
   # source - name of the program that generated this feature, or the data source (database or project name)
   feature - feature type name, e.g. Gene, Variation, Similarity
   start - Start position of the feature, with sequence numbering starting at 1.
   end - End position of the feature, with sequence numbering starting at 1.
   score - A floating point value.
   strand - defined as + (forward) or - (reverse).
   # frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
   attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.


.. py:function:: to_rows_gff3(anno)


.. py:function:: read_gff3(f, full=True, annotation=None, as_df=False, nrows=None)

   Read files in the General Feature Format.

   :param f: Path to GFF file.
   :type f: str
   :param full: Whether to read and interpret the annotation column.
   :type full: bool, default True
   :param as_df: Whether to return as pandas DataFrame instead of PyRanges.
   :type as_df: bool, default False
   :param nrows: Number of rows to read. Default None, i.e. all.
   :type nrows: int, default None

   .. rubric:: Notes

   The gff3 format encodes both Start and End as 1-based included.
   PyRanges (and also the DF returned by this function, if as_df=True), instead
   encodes intervals as 0-based, Start included and End excluded.

   .. seealso::

      :obj:`pyranges.read_gtf`
          read files in the Gene Transfer Format


.. py:function:: read_bigwig(f, as_df=False)