:py:mod:`pyranges.readers` ========================== .. py:module:: pyranges.readers Module Contents --------------- Functions ~~~~~~~~~ .. autoapisummary:: pyranges.readers.rename_core_attrs pyranges.readers.read_bed pyranges.readers.read_bam pyranges.readers._fetch_gene_transcript_exon_id pyranges.readers.skiprows pyranges.readers.read_gtf pyranges.readers.read_gtf_full pyranges.readers.parse_kv_fields pyranges.readers.to_rows pyranges.readers.to_rows_keep_duplicates pyranges.readers.read_gtf_restricted pyranges.readers.to_rows_gff3 pyranges.readers.read_gff3 pyranges.readers.read_bigwig .. py:function:: rename_core_attrs(df, ftype, rename_attr=False) Deduplicate columns from GTF attributes that share names with the default 8 columns by appending "_attr" to each name if rename_attr==True. Otherwise throw an error informing user of formatting issues. :param df: DataFrame from read_gtf :type df: pandas DataFrame :param ftype: {'gtf' or 'gff3'} rename_attr : bool, default False Whether to rename (potential) attributes with reserved column names with the suffix '_attr' or to just raise an error (default) :type ftype: str :returns: **df** -- DataFrame with deduplicated column names :rtype: pandas DataFrame .. py:function:: read_bed(f, as_df=False, nrows=None) Return bed file as PyRanges. This is a reader for files that follow the bed format. They can have from 3-12 columns which will be named like so: Chromosome Start End Name Score Strand ThickStart ThickEnd ItemRGB BlockCount BlockSizes BlockStarts :param f: Path to bed file :type f: str :param as_df: Whether to return as pandas DataFrame instead of PyRanges. :type as_df: bool, default False :param nrows: Number of rows to return. :type nrows: int, default None .. rubric:: Notes If you just want to create a PyRanges from a tab-delimited bed-like file, use `pr.PyRanges(pandas.read_table(f))` instead. .. rubric:: Examples >>> path = pr.get_example_path("aorta.bed") >>> pr.read_bed(path, nrows=5) +--------------+-----------+-----------+------------+-----------+--------------+ | Chromosome | Start | End | Name | Score | Strand | | (category) | (int64) | (int64) | (object) | (int64) | (category) | |--------------+-----------+-----------+------------+-----------+--------------| | chr1 | 9939 | 10138 | H3K27me3 | 7 | + | | chr1 | 9953 | 10152 | H3K27me3 | 5 | + | | chr1 | 9916 | 10115 | H3K27me3 | 5 | - | | chr1 | 9951 | 10150 | H3K27me3 | 8 | - | | chr1 | 9978 | 10177 | H3K27me3 | 7 | - | +--------------+-----------+-----------+------------+-----------+--------------+ Stranded PyRanges object has 5 rows and 6 columns from 1 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. >>> pr.read_bed(path, as_df=True, nrows=5) Chromosome Start End Name Score Strand 0 chr1 9916 10115 H3K27me3 5 - 1 chr1 9939 10138 H3K27me3 7 + 2 chr1 9951 10150 H3K27me3 8 - 3 chr1 9953 10152 H3K27me3 5 + 4 chr1 9978 10177 H3K27me3 7 - .. py:function:: read_bam(f, sparse=True, as_df=False, mapq=0, required_flag=0, filter_flag=1540) Return bam file as PyRanges. :param f: Path to bam file :type f: str :param sparse: Whether to return only. :type sparse: bool, default True :param as_df: Whether to return as pandas DataFrame instead of PyRanges. :type as_df: bool, default False :param mapq: Minimum mapping quality score. :type mapq: int, default 0 :param required_flag: Flags which must be present for the interval to be read. :type required_flag: int, default 0 :param filter_flag: Ignore reads with these flags. Default 1540, which means that either the read is unmapped, the read failed vendor or platfrom quality checks, or the read is a PCR or optical duplicate. :type filter_flag: int, default 1540 .. rubric:: Notes This functionality requires the library `bamread`. It can be installed with `pip install bamread` or `conda install -c bioconda bamread`. .. rubric:: Examples >>> path = pr.get_example_path("control.bam") >>> pr.read_bam(path).sort() +--------------+-----------+-----------+--------------+------------+ | Chromosome | Start | End | Strand | Flag | | (category) | (int64) | (int64) | (category) | (uint16) | |--------------+-----------+-----------+--------------+------------| | chr1 | 1041102 | 1041127 | + | 0 | | chr1 | 2129359 | 2129384 | + | 0 | | chr1 | 2239108 | 2239133 | + | 0 | | chr1 | 2318805 | 2318830 | + | 0 | | ... | ... | ... | ... | ... | | chrY | 10632456 | 10632481 | - | 16 | | chrY | 11918814 | 11918839 | - | 16 | | chrY | 11936866 | 11936891 | - | 16 | | chrY | 57402214 | 57402239 | - | 16 | +--------------+-----------+-----------+--------------+------------+ Stranded PyRanges object has 10,000 rows and 5 columns from 25 chromosomes. For printing, the PyRanges was sorted on Chromosome and Strand. .. py:function:: _fetch_gene_transcript_exon_id(attribute, annotation=None) .. py:function:: skiprows(f) .. py:function:: read_gtf(f, full=True, as_df=False, nrows=None, duplicate_attr=False, rename_attr=False, ignore_bad: bool = False) Read files in the Gene Transfer Format. :param f: Path to GTF file. :type f: str :param full: Whether to read and interpret the annotation column. :type full: bool, default True :param as_df: Whether to return as pandas DataFrame instead of PyRanges. :type as_df: bool, default False :param nrows: Number of rows to read. Default None, i.e. all. :type nrows: int, default None :param duplicate_attr: Whether to handle (potential) duplicate attributes or just keep last one. :type duplicate_attr: bool, default False :param rename_attr: Whether to rename (potential) attributes with reserved column names with the suffix '_attr' or to just raise an error (default) :type rename_attr: bool, default False :param ignore_bad: Whether to ignore bad lines or raise an error. :type ignore_bad: bool, default False .. note:: The GTF format encodes both Start and End as 1-based included. PyRanges (and also the DF returned by this function, if as_df=True), instead encodes intervals as 0-based, Start included and End excluded. .. seealso:: :obj:`pyranges.read_gff3` read files in the General Feature Format .. rubric:: Examples >>> path = pr.get_example_path("ensembl.gtf") >>> gr = pr.read_gtf(path) >>> # +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------+ >>> # | Chromosome | Source | Feature | Start | End | Score | Strand | Frame | gene_id | gene_version | +18 | >>> # | (category) | (object) | (category) | (int64) | (int64) | (object) | (category) | (object) | (object) | (object) | ... | >>> # |--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------| >>> # | 1 | havana | gene | 11868 | 14409 | . | + | . | ENSG00000223972 | 5 | ... | >>> # | 1 | havana | transcript | 11868 | 14409 | . | + | . | ENSG00000223972 | 5 | ... | >>> # | 1 | havana | exon | 11868 | 12227 | . | + | . | ENSG00000223972 | 5 | ... | >>> # | 1 | havana | exon | 12612 | 12721 | . | + | . | ENSG00000223972 | 5 | ... | >>> # | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | >>> # | 1 | ensembl | transcript | 120724 | 133723 | . | - | . | ENSG00000238009 | 6 | ... | >>> # | 1 | ensembl | exon | 133373 | 133723 | . | - | . | ENSG00000238009 | 6 | ... | >>> # | 1 | ensembl | exon | 129054 | 129223 | . | - | . | ENSG00000238009 | 6 | ... | >>> # | 1 | ensembl | exon | 120873 | 120932 | . | - | . | ENSG00000238009 | 6 | ... | >>> # +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------+ >>> # Stranded PyRanges object has 95 rows and 28 columns from 1 chromosomes. >>> # For printing, the PyRanges was sorted on Chromosome and Strand. >>> # 18 hidden columns: gene_name, gene_source, gene_biotype, transcript_id, transcript_version, transcript_name, transcript_source, transcript_biotype, tag, transcript_support_level, ... (+ 8 more.) .. py:function:: read_gtf_full(f, as_df=False, nrows=None, skiprows=0, duplicate_attr=False, rename_attr=False, ignore_bad: bool = False, chunksize: int = int(100000.0)) .. py:function:: parse_kv_fields(line) .. py:function:: to_rows(anno, ignore_bad: bool = False) .. py:function:: to_rows_keep_duplicates(anno, ignore_bad: bool = False) .. py:function:: read_gtf_restricted(f, skiprows, as_df=False, nrows=None) seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below. # source - name of the program that generated this feature, or the data source (database or project name) feature - feature type name, e.g. Gene, Variation, Similarity start - Start position of the feature, with sequence numbering starting at 1. end - End position of the feature, with sequence numbering starting at 1. score - A floating point value. strand - defined as + (forward) or - (reverse). # frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on.. attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature. .. py:function:: to_rows_gff3(anno) .. py:function:: read_gff3(f, full=True, annotation=None, as_df=False, nrows=None) Read files in the General Feature Format. :param f: Path to GFF file. :type f: str :param full: Whether to read and interpret the annotation column. :type full: bool, default True :param as_df: Whether to return as pandas DataFrame instead of PyRanges. :type as_df: bool, default False :param nrows: Number of rows to read. Default None, i.e. all. :type nrows: int, default None .. rubric:: Notes The gff3 format encodes both Start and End as 1-based included. PyRanges (and also the DF returned by this function, if as_df=True), instead encodes intervals as 0-based, Start included and End excluded. .. seealso:: :obj:`pyranges.read_gtf` read files in the Gene Transfer Format .. py:function:: read_bigwig(f, as_df=False)