Rows operations
~~~~~~~~~~~~~~~

.. contents::
   :local:
   :depth: 2


Indexing with iloc, loc
-----------------------

PyRanges inherits all the indexing and slicing capabilities of pandas, e.g. boolean Series indexing,
``iloc``, ``loc``, ``at``, ``iat``.
Note that these methods return a view, not a copy, with the caveats that it implies.
See the pandas documentation for details.
Briefly, to avoid ambiguity it is best to explicitly call ``copy`` if you want an object to not be linked
to the original object from which it was extracted. For example:

  >>> import pyranges1 as pr
  >>> gr = pr.example_data.aorta
  >>> gr
  index    |    Chromosome    Start    End      Name      Score    Strand
  int64    |    category      int64    int64    str       int64    category
  -------  ---  ------------  -------  -------  --------  -------  ----------
  0        |    chr1          9916     10115    H3K27me3  5        -
  1        |    chr1          9939     10138    H3K27me3  7        +
  2        |    chr1          9951     10150    H3K27me3  8        -
  3        |    chr1          9953     10152    H3K27me3  5        +
  ...      |    ...           ...      ...      ...       ...      ...
  7        |    chr1          10127    10326    H3K27me3  1        -
  8        |    chr1          10241    10440    H3K27me3  6        -
  9        |    chr1          10246    10445    H3K27me3  4        +
  10       |    chr1          110246   110445   H3K27me3  1        +
  PyRanges with 11 rows, 6 columns, and 1 index columns.
  Contains 1 chromosomes and 2 strands.

  >>> sgr = gr.iloc[0:3].copy()
  >>> sgr
    index  |    Chromosome      Start      End  Name        Score  Strand
    int64  |    category        int64    int64  str         int64  category
  -------  ---  ------------  -------  -------  --------  -------  ----------
        0  |    chr1             9916    10115  H3K27me3        5  -
        1  |    chr1             9939    10138  H3K27me3        7  +
        2  |    chr1             9951    10150  H3K27me3        8  -
  PyRanges with 3 rows, 6 columns, and 1 index columns.
  Contains 1 chromosomes and 2 strands.

  >>> sgr['Score'] = 100  # does not modify gr
  >>> gr.head(3)
    index  |    Chromosome      Start      End  Name        Score  Strand
    int64  |    category        int64    int64  str         int64  category
  -------  ---  ------------  -------  -------  --------  -------  ----------
        0  |    chr1             9916    10115  H3K27me3        5  -
        1  |    chr1             9939    10138  H3K27me3        7  +
        2  |    chr1             9951    10150  H3K27me3        8  -
  PyRanges with 3 rows, 6 columns, and 1 index columns.
  Contains 1 chromosomes and 2 strands.

On the other hand, to modify even a few lines of a PyRanges object,
use ``loc`` (label-based indexing) or ``iloc`` (positional indexing) on the whole object, not on a view:

  >>> gr.loc[0:2, 'Score'] = 100  # modifies gr
  >>> gr.head(5)
    index  |    Chromosome      Start      End  Name        Score  Strand
    int64  |    category        int64    int64  str         int64  category
  -------  ---  ------------  -------  -------  --------  -------  ----------
        0  |    chr1             9916    10115  H3K27me3      100  -
        1  |    chr1             9939    10138  H3K27me3      100  +
        2  |    chr1             9951    10150  H3K27me3      100  -
        3  |    chr1             9953    10152  H3K27me3        5  +
        4  |    chr1             9978    10177  H3K27me3        7  -
  PyRanges with 5 rows, 6 columns, and 1 index columns.
  Contains 1 chromosomes and 2 strands.

Using boolean indexers
-----------------------

In PyRanges, boolean indexers work analogously as in pandas, as already seen in the tutorial:

  >>> sel = (gr['Score'] == 100)
  >>> gr[sel]
    index  |    Chromosome      Start      End  Name        Score  Strand
    int64  |    category        int64    int64  str         int64  category
  -------  ---  ------------  -------  -------  --------  -------  ----------
        0  |    chr1             9916    10115  H3K27me3      100  -
        1  |    chr1             9939    10138  H3K27me3      100  +
        2  |    chr1             9951    10150  H3K27me3      100  -
  PyRanges with 3 rows, 6 columns, and 1 index columns.
  Contains 1 chromosomes and 2 strands.

  >>> gr[gr.Score==5]
    index  |    Chromosome      Start      End  Name        Score  Strand
    int64  |    category        int64    int64  str         int64  category
  -------  ---  ------------  -------  -------  --------  -------  ----------
        3  |    chr1             9953    10152  H3K27me3        5  +
        5  |    chr1            10001    10200  H3K27me3        5  -
  PyRanges with 2 rows, 6 columns, and 1 index columns.
  Contains 1 chromosomes and 2 strands.

Combined with ``loc``, boolean indexers can be used to add or update column values:

  >>> gr.loc[gr['Score'] < 6, 'Score'] = -10  # modifies gr
  >>> gr.head(5)
    index  |    Chromosome      Start      End  Name        Score  Strand
    int64  |    category        int64    int64  str         int64  category
  -------  ---  ------------  -------  -------  --------  -------  ----------
        0  |    chr1             9916    10115  H3K27me3      100  -
        1  |    chr1             9939    10138  H3K27me3      100  +
        2  |    chr1             9951    10150  H3K27me3      100  -
        3  |    chr1             9953    10152  H3K27me3      -10  +
        4  |    chr1             9978    10177  H3K27me3        7  -
  PyRanges with 5 rows, 6 columns, and 1 index columns.
  Contains 1 chromosomes and 2 strands.

In pandas, these logical operators can be employed with boolean Series:

* "&" =  element-wise AND operator
* "|" = element-wise OR operator
* "~" = NOT operator, inverts the values of the Series on its right

When using logical operators, make sure to parenthesize properly.

Let's get the + intervals with Score<8 starting before 10,000 or ending after 100,000:

  >>> gr[ (gr.Score<8) & (gr.Strand=='+') &
  ...     ((gr.Start<10000) | (gr.End>100000)) ]
    index  |    Chromosome      Start      End  Name        Score  Strand
    int64  |    category        int64    int64  str         int64  category
  -------  ---  ------------  -------  -------  --------  -------  ----------
        3  |    chr1             9953    10152  H3K27me3      -10  +
       10  |    chr1           110246   110445  H3K27me3      -10  +
  PyRanges with 2 rows, 6 columns, and 1 index columns.
  Contains 1 chromosomes and 1 strands.

Let's invert the selection:
  >>> gr[~(
  ...      (gr.Score<8) & (gr.Strand=='+') &
  ...      ((gr.Start<10000) | (gr.End>100000)) )]
  index    |    Chromosome    Start    End      Name      Score    Strand
  int64    |    category      int64    int64    str       int64    category
  -------  ---  ------------  -------  -------  --------  -------  ----------
  0        |    chr1          9916     10115    H3K27me3  100      -
  1        |    chr1          9939     10138    H3K27me3  100      +
  2        |    chr1          9951     10150    H3K27me3  100      -
  4        |    chr1          9978     10177    H3K27me3  7        -
  ...      |    ...           ...      ...      ...       ...      ...
  6        |    chr1          10024    10223    H3K27me3  -10      +
  7        |    chr1          10127    10326    H3K27me3  -10      -
  8        |    chr1          10241    10440    H3K27me3  6        -
  9        |    chr1          10246    10445    H3K27me3  -10      +
  PyRanges with 9 rows, 6 columns, and 1 index columns.
  Contains 1 chromosomes and 2 strands.

Using PyRanges .loci
---------------------

PyRanges provides the method :func:`loci <pyranges1.PyRanges.loci>`
to select rows by genomic region:

  >>> gr2 = pr.example_data.aorta2.sort_ranges()
  >>> gr2
  index    |    Chromosome    Start    End      Name    Score    Strand
  int64    |    category      int64    int64    str     int64    category
  -------  ---  ------------  -------  -------  ------  -------  ----------
  1        |    chr1          10073    10272    Input   1        +
  5        |    chr1          10280    10479    Input   1        +
  6        |    chr1          16056    16255    Input   1        +
  7        |    chr1          16064    16263    Input   1        +
  ...      |    ...           ...      ...      ...     ...      ...
  4        |    chr1          10149    10348    Input   1        -
  3        |    chr1          10082    10281    Input   1        -
  2        |    chr1          10079    10278    Input   1        -
  0        |    chr1          9988     10187    Input   1        -
  PyRanges with 10 rows, 6 columns, and 1 index columns.
  Contains 1 chromosomes and 2 strands.

Various syntaxes are accepted, see its API. For example:

  >>> gr2.loci['-'] # get all rows with strand '-'
    index  |    Chromosome      Start      End  Name      Score  Strand
    int64  |    category        int64    int64  str       int64  category
  -------  ---  ------------  -------  -------  ------  -------  ----------
        9  |    chr1            19958    20157  Input         1  -
        4  |    chr1            10149    10348  Input         1  -
        3  |    chr1            10082    10281  Input         1  -
        2  |    chr1            10079    10278  Input         1  -
        0  |    chr1             9988    10187  Input         1  -
  PyRanges with 5 rows, 6 columns, and 1 index columns.
  Contains 1 chromosomes and 1 strands.

  >>> gr2.loci['chr1', '+'] # get all rows with chromosome 'chr1' and strand '+'
    index  |    Chromosome      Start      End  Name      Score  Strand
    int64  |    category        int64    int64  str       int64  category
  -------  ---  ------------  -------  -------  ------  -------  ----------
        1  |    chr1            10073    10272  Input         1  +
        5  |    chr1            10280    10479  Input         1  +
        6  |    chr1            16056    16255  Input         1  +
        7  |    chr1            16064    16263  Input         1  +
        8  |    chr1            16109    16308  Input         1  +
  PyRanges with 5 rows, 6 columns, and 1 index columns.
  Contains 1 chromosomes and 1 strands.

  >>> gr2.loci['chr1', 10000:11000] # get all rows on 'chr1' and overlapping 10000:11000
    index  |    Chromosome      Start      End  Name      Score  Strand
    int64  |    category        int64    int64  str       int64  category
  -------  ---  ------------  -------  -------  ------  -------  ----------
        1  |    chr1            10073    10272  Input         1  +
        5  |    chr1            10280    10479  Input         1  +
        4  |    chr1            10149    10348  Input         1  -
        3  |    chr1            10082    10281  Input         1  -
        2  |    chr1            10079    10278  Input         1  -
        0  |    chr1             9988    10187  Input         1  -
  PyRanges with 6 rows, 6 columns, and 1 index columns.
  Contains 1 chromosomes and 2 strands.

  >>> gr2.loci['chr1', '+', 10000:11000] # get all rows on 'chr1', strand '+', and overlapping 10000:11000
    index  |    Chromosome      Start      End  Name      Score  Strand
    int64  |    category        int64    int64  str       int64  category
  -------  ---  ------------  -------  -------  ------  -------  ----------
        1  |    chr1            10073    10272  Input         1  +
        5  |    chr1            10280    10479  Input         1  +
  PyRanges with 2 rows, 6 columns, and 1 index columns.
  Contains 1 chromosomes and 1 strands.

  Loci also support assignment. In this case, you must provide a DataFrame with the same shape as the selection:

  >>> gr2.loci['chr1', '+', 10000:11000] = gr2.loci['chr1', '+', 10000:11000].copy().assign(Score=100)
  >>> gr2.loci['chr1', '+', 10000:11000]  # see below that the Score was altered
    index  |    Chromosome      Start      End  Name      Score  Strand
    int64  |    category        int64    int64  str       int64  category
  -------  ---  ------------  -------  -------  ------  -------  ----------
        1  |    chr1            10073    10272  Input       100  +
        5  |    chr1            10280    10479  Input       100  +
  PyRanges with 2 rows, 6 columns, and 1 index columns.
  Contains 1 chromosomes and 1 strands.

  For more flexible assignment, you can use ``loc`` and provide use the ``index`` attribute of ``loci`` output:

  >>> sindex=gr2.loci['chr1', '+', 16000:17000].index
  >>> gr2.loc[sindex, "Score"]=150
  >>> gr2
  index    |    Chromosome    Start    End      Name    Score    Strand
  int64    |    category      int64    int64    str     int64    category
  -------  ---  ------------  -------  -------  ------  -------  ----------
  1        |    chr1          10073    10272    Input   100      +
  5        |    chr1          10280    10479    Input   100      +
  6        |    chr1          16056    16255    Input   150      +
  7        |    chr1          16064    16263    Input   150      +
  ...      |    ...           ...      ...      ...     ...      ...
  4        |    chr1          10149    10348    Input   1        -
  3        |    chr1          10082    10281    Input   1        -
  2        |    chr1          10079    10278    Input   1        -
  0        |    chr1          9988     10187    Input   1        -
  PyRanges with 10 rows, 6 columns, and 1 index columns.
  Contains 1 chromosomes and 2 strands.


Sorting PyRanges
----------------

PyRanges objects can be sorted (i.e. altering the order of rows) by calling the pandas dataframe method ``sort_values``,
or the PyRanges method :func:`sort_ranges <pyranges1.PyRanges.sort_ranges>`.

  >>> import random; random.seed(1)
  >>> c = pr.example_data.chipseq.remove_nonloc_columns()
  >>> c['peak'] = [random.randint(0, 1000) for _ in range(len(c))] # add a column with random values
  >>> c
  index    |    Chromosome    Start      End        Strand      peak
  int64    |    category      int64      int64      category    int64
  -------  ---  ------------  ---------  ---------  ----------  -------
  0        |    chr8          28510032   28510057   -           137
  1        |    chr7          107153363  107153388  -           582
  2        |    chr5          135821802  135821827  -           867
  3        |    chr14         19418999   19419024   -           821
  ...      |    ...           ...        ...        ...         ...
  16       |    chr9          120803448  120803473  +           96
  17       |    chr6          89296757   89296782   -           499
  18       |    chr1          194245558  194245583  +           29
  19       |    chr8          57916061   57916086   +           914
  PyRanges with 20 rows, 5 columns, and 1 index columns.
  Contains 15 chromosomes and 2 strands.


Pandas ``sort_values`` sorts the whole dataframe by the specified columns. See its API for details.
For example, let's sort by column ``peak``:

  >>> c.sort_values(by='peak', ascending=False)
  index    |    Chromosome    Start      End        Strand      peak
  int64    |    category      int64      int64      category    int64
  -------  ---  ------------  ---------  ---------  ----------  -------
  19       |    chr8          57916061   57916086   +           914
  2        |    chr5          135821802  135821827  -           867
  3        |    chr14         19418999   19419024   -           821
  14       |    chr2          152562484  152562509  -           807
  ...      |    ...           ...        ...        ...         ...
  7        |    chr19         19571102   19571127   +           120
  16       |    chr9          120803448  120803473  +           96
  5        |    chr21         40099618   40099643   +           64
  18       |    chr1          194245558  194245583  +           29
  PyRanges with 20 rows, 5 columns, and 1 index columns.
  Contains 15 chromosomes and 2 strands.


PyRanges :func:`sort_ranges <pyranges1.PyRanges.sort_ranges>` is designed for genomic ranges.
By default, it sorts by Chromosome, Strand, then interval coordinates. If Strands are valid (
see :func:`strand_valid <pyranges1.PyRanges.strand_valid>`), then intervals on the reverse strand are
sorted in reverse order:

  >>> c.sort_ranges()
  index    |    Chromosome    Start      End        Strand      peak
  int64    |    category      int64      int64      category    int64
  -------  ---  ------------  ---------  ---------  ----------  -------
  12       |    chr1          38457520   38457545   +           667
  18       |    chr1          194245558  194245583  +           29
  13       |    chr1          80668132   80668157   -           388
  14       |    chr2          152562484  152562509  -           807
  ...      |    ...           ...        ...        ...         ...
  4        |    chr12         106679761  106679786  -           782
  3        |    chr14         19418999   19419024   -           821
  7        |    chr19         19571102   19571127   +           120
  5        |    chr21         40099618   40099643   +           64
  PyRanges with 20 rows, 5 columns, and 1 index columns.
  Contains 15 chromosomes and 2 strands.


Above, ``chr8`` appears before ``chr10`` because of 'natural sorting'. We can force alphabetical sorting instead:

  >>> c.sort_ranges(natsort=False)
  index    |    Chromosome    Start      End        Strand      peak
  int64    |    category      int64      int64      category    int64
  -------  ---  ------------  ---------  ---------  ----------  -------
  12       |    chr1          38457520   38457545   +           667
  18       |    chr1          194245558  194245583  +           29
  13       |    chr1          80668132   80668157   -           388
  9        |    chr10         35419784   35419809   -           779
  ...      |    ...           ...        ...        ...         ...
  19       |    chr8          57916061   57916086   +           914
  0        |    chr8          28510032   28510057   -           137
  6        |    chr8          22714402   22714427   -           261
  16       |    chr9          120803448  120803473  +           96
  PyRanges with 20 rows, 5 columns, and 1 index columns.
  Contains 15 chromosomes and 2 strands.
    
To sort by a different column, use the first argument (``by``). This is used after Chromosome and Strand, but before
coordinates:

  >>> c.sort_ranges('peak')
  index    |    Chromosome    Start      End        Strand      peak
  int64    |    category      int64      int64      category    int64
  -------  ---  ------------  ---------  ---------  ----------  -------
  18       |    chr1          194245558  194245583  +           29
  12       |    chr1          38457520   38457545   +           667
  13       |    chr1          80668132   80668157   -           388
  14       |    chr2          152562484  152562509  -           807
  ...      |    ...           ...        ...        ...         ...
  4        |    chr12         106679761  106679786  -           782
  3        |    chr14         19418999   19419024   -           821
  7        |    chr19         19571102   19571127   +           120
  5        |    chr21         40099618   40099643   +           64
  PyRanges with 20 rows, 5 columns, and 1 index columns.
  Contains 15 chromosomes and 2 strands.

To use a different priorization of genomic location columns, specify them in the first argument (``by``):

  >>> c.sort_ranges(['peak', 'Chromosome', 'Strand'])
  index    |    Chromosome    Start      End        Strand      peak
  int64    |    category      int64      int64      category    int64
  -------  ---  ------------  ---------  ---------  ----------  -------
  18       |    chr1          194245558  194245583  +           29
  12       |    chr1          38457520   38457545   +           667
  13       |    chr1          80668132   80668157   -           388
  14       |    chr2          152562484  152562509  -           807
  ...      |    ...           ...        ...        ...         ...
  4        |    chr12         106679761  106679786  -           782
  3        |    chr14         19418999   19419024   -           821
  7        |    chr19         19571102   19571127   +           120
  5        |    chr21         40099618   40099643   +           64
  PyRanges with 20 rows, 5 columns, and 1 index columns.
  Contains 15 chromosomes and 2 strands.