RangeFrame

The RangeFrame is the parent class of PyRanges. It supports interval-based operations that do not require the data to contain Chromosome and Strand information. It is a subclass of pandas.DataFrame.

class pyranges1.range_frame.range_frame.RangeFrame(*args, **kwargs)

Class for range based operations.

A table with Start and End columns. Parent class of PyRanges. Subclass of pandas DataFrame.

cluster_overlaps(*, match_by: str | Iterable[str] | None = None, cluster_column: str = 'Cluster', slack: int = 0) RangeFrame

Give overlapping intervals a common id.

Parameters:
  • match_by (str or list, default None) – If provided, only intervals with an equal value in column(s) match_by may be considered as overlapping.

  • slack (int, default 0) – Length by which the criteria of overlap are loosened. A value of 1 clusters also bookended intervals. Higher slack values cluster more distant intervals (with a maximum distance of slack-1 between them).

  • cluster_column – Name the cluster column added in output. Default: “Cluster”

Returns:

RangeFrame with an ID-column “Cluster” added.

Return type:

RangeFrame

See also

RangeFrame.merge

combine overlapping intervals into one

combine_interval_columns(function: Literal['intersect', 'union', 'swap'] | CombineIntervalColumnsOperation = 'intersect', *, start: str = 'Start', end: str = 'End', start2: str = 'Start_b', end2: str = 'End_b', drop_old_columns: bool = True) RangeFrame

Use two pairs of columns representing intervals to create a new start and end column.

The function is designed as post-processing after join_overlaps to aggregate the coordinates of the two intervals. By default, the new start and end columns will be the intersection of the intervals.

Parameters:
  • function ({"intersect", "union", "swap"} or Callable, default "intersect") – How to combine the self and other intervals: “intersect”, “union”, or “swap” If a callable is passed, it should take four Series arguments: start1, end1, start2, end2; and return a tuple of two integers: (new_starts, new_ends).

  • start (str, default "Start") – Column name for Start of first interval

  • end (str, default "End") – Column name for End of first interval

  • start2 (str, default "Start_b") – Column name for Start of second interval

  • end2 (str, default "End_b") – Column name for End of second interval

  • drop_old_columns (bool, default True) – Whether to drop the above mentioned columns.

copy(*args, **kwargs) RangeFrame

Make a copy of this object’s indices and data.

When deep=True (default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object (see notes below).

When deep=False, a new object will be created without copying the calling object’s data or index (only references to the data and index are copied). Any changes to the data of the original will be reflected in the shallow copy (and vice versa).

Note

The deep=False behaviour as described above will change in pandas 3.0. Copy-on-Write will be enabled by default, which means that the “shallow” copy is that is returned with deep=False will still avoid making an eager copy, but changes to the data of the original will no longer be reflected in the shallow copy (or vice versa). Instead, it makes use of a lazy (deferred) copy mechanism that will copy the data only when any changes to the original or shallow copy is made.

You can already get the future behavior and improvements through enabling copy on write pd.options.mode.copy_on_write = True

Parameters:

deep (bool, default True) – Make a deep copy, including a copy of the data and the indices. With deep=False neither the indices nor the data are copied.

Returns:

Object type matches caller.

Return type:

Series or DataFrame

Notes

When deep=True, data is copied but actual Python objects will not be copied recursively, only the reference to the object. This is in contrast to copy.deepcopy in the Standard Library, which recursively copies object data (see examples below).

While Index objects are copied when deep=True, the underlying numpy array is not copied for performance reasons. Since Index is immutable, the underlying data can be safely shared and a copy is not needed.

Since pandas is not thread safe, see the gotchas when copying in a threading environment.

When copy_on_write in pandas config is set to True, the copy_on_write config takes effect even when deep=False. This means that any changes to the copied data would make a new copy of the data upon write (and vice versa). Changes made to either the original or copied variable would not be reflected in the counterpart. See Copy_on_Write for more information.

Examples

>>> s = pd.Series([1, 2], index=["a", "b"])
>>> s
a    1
b    2
dtype: int64
>>> s_copy = s.copy()
>>> s_copy
a    1
b    2
dtype: int64

Shallow copy versus default (deep) copy:

>>> s = pd.Series([1, 2], index=["a", "b"])
>>> deep = s.copy()
>>> shallow = s.copy(deep=False)

Shallow copy shares data and index with original.

>>> s is shallow
False
>>> s.values is shallow.values and s.index is shallow.index
True

Deep copy has own copy of data and index.

>>> s is deep
False
>>> s.values is deep.values or s.index is deep.index
False

Updates to the data shared by shallow copy and original is reflected in both (NOTE: this will no longer be true for pandas >= 3.0); deep copy remains unchanged.

>>> s.iloc[0] = 3
>>> shallow.iloc[1] = 4
>>> s
a    3
b    4
dtype: int64
>>> shallow
a    3
b    4
dtype: int64
>>> deep
a    1
b    2
dtype: int64

Note that when copying an object containing Python objects, a deep copy will copy the data, but will not do so recursively. Updating a nested data object will be reflected in the deep copy.

>>> s = pd.Series([[1, 2], [3, 4]])
>>> deep = s.copy()
>>> s[0][0] = 10
>>> s
0    [10, 2]
1     [3, 4]
dtype: object
>>> deep
0    [10, 2]
1     [3, 4]
dtype: object

Copy-on-Write is set to true, the shallow copy is not modified when the original data is changed:

>>> with pd.option_context("mode.copy_on_write", True):
...     s = pd.Series([1, 2], index=["a", "b"])
...     copy = s.copy(deep=False)
...     s.iloc[0] = 100
...     s
a    100
b      2
dtype: int64
>>> copy
a    1
b    2
dtype: int64
count_overlaps(other: RangeFrame, *, match_by: str | list[str] | None = None, slack: int = 0) Series

Count the number of overlaps per interval.

For each interval in self, count how many intervals in other overlap with it. The overlap computation is based on the start and end coordinates, with an optional slack parameter to adjust the overlap threshold by temporarily extending the intervals.

Parameters:
  • other (RangeFrame) – The RangeFrame whose intervals are compared against those in self for overlap counting.

  • match_by (str or list, default None) – Column(s) to group intervals by when determining overlaps. Only intervals with equal values in the specified column(s) will be considered as overlapping.

  • slack (int, default 0) – Temporarily extend intervals in self by this many nucleotides before checking for overlaps, thereby adjusting the overlap threshold.

Returns:

A pandas Series where each element corresponds to the number of overlapping intervals in other for the corresponding interval in self.

Return type:

pd.Series

drop(*args, **kwargs) RangeFrame | None

Drop specified labels from rows or columns.

Remove rows or columns by specifying label names and corresponding axis, or by directly specifying index or column names. When using a multi-index, labels on different levels can be removed by specifying the level. See the user guide for more information about the now unused levels.

Parameters:
  • labels (single label or list-like) – Index or column labels to drop. A tuple will be used as a single label and not treated as a list-like.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).

  • index (single label or list-like) – Alternative to specifying axis (labels, axis=0 is equivalent to index=labels).

  • columns (single label or list-like) – Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).

  • level (int or level name, optional) – For MultiIndex, level from which the labels will be removed.

  • inplace (bool, default False) – If False, return a copy. Otherwise, do operation in place and return None.

  • errors ({'ignore', 'raise'}, default 'raise') – If ‘ignore’, suppress error and only existing labels are dropped.

Returns:

Returns DataFrame or None DataFrame with the specified index or column labels removed or None if inplace=True.

Return type:

DataFrame or None

Raises:

KeyError – If any of the labels is not found in the selected axis.

See also

DataFrame.loc

Label-location based indexer for selection by label.

DataFrame.dropna

Return DataFrame with labels on given axis omitted where (all or any) data are missing.

DataFrame.drop_duplicates

Return DataFrame with duplicate rows removed, optionally only considering certain columns.

Series.drop

Return Series with specified index labels removed.

Examples

>>> df = pd.DataFrame(np.arange(12).reshape(3, 4),
...                   columns=['A', 'B', 'C', 'D'])
>>> df
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

Drop columns

>>> df.drop(['B', 'C'], axis=1)
   A   D
0  0   3
1  4   7
2  8  11
>>> df.drop(columns=['B', 'C'])
   A   D
0  0   3
1  4   7
2  8  11

Drop a row by index

>>> df.drop([0, 1])
   A  B   C   D
2  8  9  10  11

Drop columns and/or rows of MultiIndex DataFrame

>>> midx = pd.MultiIndex(levels=[['llama', 'cow', 'falcon'],
...                              ['speed', 'weight', 'length']],
...                      codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
...                             [0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> df = pd.DataFrame(index=midx, columns=['big', 'small'],
...                   data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
...                         [250, 150], [1.5, 0.8], [320, 250],
...                         [1, 0.8], [0.3, 0.2]])
>>> df
                big     small
llama   speed   45.0    30.0
        weight  200.0   100.0
        length  1.5     1.0
cow     speed   30.0    20.0
        weight  250.0   150.0
        length  1.5     0.8
falcon  speed   320.0   250.0
        weight  1.0     0.8
        length  0.3     0.2

Drop a specific index combination from the MultiIndex DataFrame, i.e., drop the combination 'falcon' and 'weight', which deletes only the corresponding row

>>> df.drop(index=('falcon', 'weight'))
                big     small
llama   speed   45.0    30.0
        weight  200.0   100.0
        length  1.5     1.0
cow     speed   30.0    20.0
        weight  250.0   150.0
        length  1.5     0.8
falcon  speed   320.0   250.0
        length  0.3     0.2
>>> df.drop(index='cow', columns='small')
                big
llama   speed   45.0
        weight  200.0
        length  1.5
falcon  speed   320.0
        weight  1.0
        length  0.3
>>> df.drop(index='length', level=1)
                big     small
llama   speed   45.0    30.0
        weight  200.0   100.0
cow     speed   30.0    20.0
        weight  250.0   150.0
falcon  speed   320.0   250.0
        weight  1.0     0.8
join_overlaps(other: RangeFrame, *, join_type: Literal['inner', 'left', 'outer', 'right'] = 'inner', multiple: Literal['first', 'all', 'last', 'contained'] = 'all', match_by: str | Iterable[str] | None = None, slack: int = 0, suffix: str = '_b', contained_intervals_only: bool = False, report_overlap_column: str | None = None, preserve_input_order: bool = True) RangeFrame

Join RangeFrame objects based on overlapping intervals.

Find pairs of overlapping intervals between self and other and combine their attributes. Each row in the output contains columns from both intervals, including their start and end positions. By default, only overlapping intervals are included, but the join_type parameter controls how intervals without overlaps are handled.

Parameters:
  • other (RangeFrame) – The RangeFrame to join with.

  • join_type ({"inner", "left", "right", "outer"}, default "inner") – Specifies how to handle intervals that do not overlap. “inner” returns only overlapping intervals, “left” returns all intervals from self (with missing values for non-overlapping intervals from other), “right” returns all intervals from other, and “outer” returns all intervals from both.

  • multiple ({"all", "first", "last"}, default "all") – Determines which overlapping interval(s) to report when multiple intervals in other overlap the same interval in self. “all” reports all overlaps (which may lead to duplicate rows), “first” reports only the overlapping interval with the smallest start in other, and “last” reports only the overlapping interval with the largest end in other.

  • match_by (str or list, default None) – If provided, only intervals with matching values in the specified column(s) will be joined.

  • slack (int, default 0) – Temporarily extend intervals in self by this many units on both ends before checking for overlaps.

  • suffix (str, default JOIN_SUFFIX) – Suffix to append to columns from the other RangeFrame in the output.

  • contained_intervals_only (bool, default False) – If True, only join intervals from self that are entirely contained within an interval from other.

  • report_overlap_column (str or None, default None) – If provided, add a column with this name reporting the amount of overlap between joined intervals. The overlap is computed as the minimum of the end positions minus the maximum of the start positions.

  • preserve_input_order (bool, default True) –

    Whether to preserve the original input order in the result.

    If False, rows may be returned in algorithm/output order instead, which can be faster for large results.

Returns:

A new RangeFrame containing the joined intervals with columns from both input RangeFrames. The indices of the input RangeFrames are not preserved in the output.

Return type:

RangeFrame

Notes

Attributes from the other RangeFrame may have their column names modified by appending the specified suffix.

max_disjoint_overlaps(*, slack: int = 0, match_by: str | Iterable[str] | None = None, preserve_input_order: bool = True) RangeFrame

Find the maximal disjoint set of intervals.

Returns a subset of the rows in self so that no two intervals overlap, choosing those that maximize the number of intervals in the result.

Parameters:
  • slack (int, default 0) – Length by which the criteria of overlap are loosened. A value of 1 implies that bookended intervals are considered overlapping. Higher slack values allow more distant intervals (with a maximum distance of slack-1 between them).

  • match_by (str or list, default None) – If provided, only intervals with an equal value in column(s) match_by may be considered as overlapping.

  • preserve_input_order (bool, default True) –

    Whether to preserve the original input order in the result.

    If False, rows may be returned in algorithm/output order instead, which can be faster for large results.

Returns:

RangeFrame with maximal disjoint set of intervals.

Return type:

RangeFrame

See also

RangeFrame.merge_overlaps

merge intervals into non-overlapping superintervals

RangeFrame.cluster

annotate overlapping intervals with common ID

merge_overlaps(*, count_col: str | None = None, match_by: str | Iterable[str] | None = None, slack: int = 0) RangeFrame

Merge overlapping intervals into one.

Merge overlapping intervals into a single superinterval by uniting intervals that overlap, optionally allowing a small gap (specified by slack) between intervals to be merged. The resulting RangeFrame will contain the merged intervals, and if count_col is provided, a column with the counts of merged intervals will be included.

Parameters:
  • count_col (str or None, default None) – Name of the column to store the count of intervals merged into each superinterval. If None, no count column is added.

  • match_by (str or list, default None) – Column(s) to group intervals by before merging. Only intervals with equal values in the specified column(s) will be considered as overlapping.

  • slack (int, default 0) – Allow this many nucleotides between intervals to still consider them overlapping.

Returns:

A RangeFrame with merged (super) intervals. Metadata columns, index, and order are not necessarily preserved.

Return type:

RangeFrame

nearest_ranges(other: RangeFrame, *, match_by: str | Iterable[str] | None = None, suffix: str = '_b', exclude_overlaps: bool = False, k: int = 1, dist_col: str | None = 'Distance', direction: Literal['any', 'forward', 'backward'] = 'any', preserve_input_order: bool = True) RangeFrame

Find closest interval.

For each interval in self RangeFrame, the columns of the nearest interval in other RangeFrame are appended.

Parameters:
  • other (RangeFrame) – RangeFrame to find nearest interval in.

  • exclude_overlaps (bool, default True) – Whether to not report intervals of others that overlap with self as the nearest ones.

  • direction ({"any", "forward", "backward"}, default "any", i.e. both directions) – Whether to only look for nearest in one direction.

  • match_by (str or list, default None) – If provided, only intervals with an equal value in column(s) match_by may be matched.

  • k (int, default 1) – Number of nearest intervals to fetch.

  • suffix (str, default "_b") – Suffix to give columns with shared name in other.

  • dist_col (str or None) – Optional column to store the distance in.

  • preserve_input_order (bool, default True) –

    Whether to preserve the original input order in the result.

    If False, rows may be returned in algorithm/output order instead, which can be faster for large results.

Returns:

A RangeFrame with columns representing nearest interval horizontally appended.

Return type:

RangeFrame

See also

RangeFrame.join_overlaps

Has a slack argument to find intervals within a distance.

overlap(other: RangeFrame, multiple: Literal['first', 'all', 'last', 'contained'] = 'all', slack: int = 0, *, contained_intervals_only: bool = False, match_by: str | Iterable[str] | None = None, preserve_input_order: bool = True) RangeFrame

Return overlapping intervals.

Returns the intervals in self which overlap with those in other.

Parameters:
  • other (RangeFrame) – RangeFrame to find overlaps with.

  • multiple ({"all", "first", "last"}, default "all") – What intervals to report when multiple intervals in ‘other’ overlap with the same interval in self. The default “all” reports all overlapping subintervals, which will have duplicate indices. “first” reports only, for each interval in self, the overlapping subinterval with smallest Start in ‘other’ “last” reports only the overlapping subinterval with the biggest End in ‘other’

  • slack (int, default 0) – Intervals in self are temporarily extended by slack on both ends before overlap is calculated, so that we allow non-overlapping intervals to be considered overlapping if they are within less than slack distance e.g. slack=1 reports bookended intervals.

  • contained_intervals_only (bool, default False) – Whether to report only intervals that are entirely contained in an interval of ‘other’.

  • match_by (str or list, default None) – If provided, only overlapping intervals with an equal value in column(s) match_by are reported.

  • preserve_input_order (bool, default True) –

    Whether to preserve the original input order in the result.

    If False, rows may be returned in algorithm/output order instead, which can be faster for large results.

Returns:

A RangeFrame with overlapping intervals.

Return type:

RangeFrame

See also

RangeFrame.intersect

report overlapping subintervals

RangeFrame.set_intersect

set-intersect RangeFrame (e.g. merge then intersect)

reindex(*args, **kwargs) RangeFrame

Conform DataFrame to new index with optional filling logic.

Places NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False.

Parameters:
  • labels (array-like, optional) – New labels / index to conform the axis specified by ‘axis’ to.

  • index (array-like, optional) – New labels for the index. Preferably an Index object to avoid duplicating data.

  • columns (array-like, optional) – New labels for the columns. Preferably an Index object to avoid duplicating data.

  • axis (int or str, optional) – Axis to target. Can be either the axis name (‘index’, ‘columns’) or number (0, 1).

  • method ({None, 'backfill'/'bfill', 'pad'/'ffill', 'nearest'}) –

    Method to use for filling holes in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index.

    • None (default): don’t fill gaps

    • pad / ffill: Propagate last valid observation forward to next valid.

    • backfill / bfill: Use next valid observation to fill gap.

    • nearest: Use nearest valid observations to fill gap.

  • copy (bool, default True) –

    Return a new object, even if the passed indexes are the same.

    Note

    The copy keyword will change behavior in pandas 3.0. Copy-on-Write will be enabled by default, which means that all methods with a copy keyword will use a lazy copy mechanism to defer the copy and ignore the copy keyword. The copy keyword will be removed in a future version of pandas.

    You can already get the future behavior and improvements through enabling copy on write pd.options.mode.copy_on_write = True

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (scalar, default np.nan) – Value to use for missing values. Defaults to NaN, but can be any “compatible” value.

  • limit (int, default None) – Maximum number of consecutive elements to forward or backward fill.

  • tolerance (optional) –

    Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation abs(index[indexer] - target) <= tolerance.

    Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type.

Return type:

DataFrame with changed index.

See also

DataFrame.set_index

Set row labels.

DataFrame.reset_index

Remove row labels or move them to new columns.

DataFrame.reindex_like

Change to same indices as other DataFrame.

Examples

DataFrame.reindex supports two calling conventions

  • (index=index_labels, columns=column_labels, ...)

  • (labels, axis={'index', 'columns'}, ...)

We highly recommend using keyword arguments to clarify your intent.

Create a dataframe with some fictional data.

>>> index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror']
>>> df = pd.DataFrame({'http_status': [200, 200, 404, 404, 301],
...                   'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]},
...                   index=index)
>>> df
           http_status  response_time
Firefox            200           0.04
Chrome             200           0.02
Safari             404           0.07
IE10               404           0.08
Konqueror          301           1.00

Create a new index and reindex the dataframe. By default values in the new index that do not have corresponding records in the dataframe are assigned NaN.

>>> new_index = ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10',
...              'Chrome']
>>> df.reindex(new_index)
               http_status  response_time
Safari               404.0           0.07
Iceweasel              NaN            NaN
Comodo Dragon          NaN            NaN
IE10                 404.0           0.08
Chrome               200.0           0.02

We can fill in the missing values by passing a value to the keyword fill_value. Because the index is not monotonically increasing or decreasing, we cannot use arguments to the keyword method to fill the NaN values.

>>> df.reindex(new_index, fill_value=0)
               http_status  response_time
Safari                 404           0.07
Iceweasel                0           0.00
Comodo Dragon            0           0.00
IE10                   404           0.08
Chrome                 200           0.02
>>> df.reindex(new_index, fill_value='missing')
              http_status response_time
Safari                404          0.07
Iceweasel         missing       missing
Comodo Dragon     missing       missing
IE10                  404          0.08
Chrome                200          0.02

We can also reindex the columns.

>>> df.reindex(columns=['http_status', 'user_agent'])
           http_status  user_agent
Firefox            200         NaN
Chrome             200         NaN
Safari             404         NaN
IE10               404         NaN
Konqueror          301         NaN

Or we can use “axis-style” keyword arguments

>>> df.reindex(['http_status', 'user_agent'], axis="columns")
           http_status  user_agent
Firefox            200         NaN
Chrome             200         NaN
Safari             404         NaN
IE10               404         NaN
Konqueror          301         NaN

To further illustrate the filling functionality in reindex, we will create a dataframe with a monotonically increasing index (for example, a sequence of dates).

>>> date_index = pd.date_range('1/1/2010', periods=6, freq='D')
>>> df2 = pd.DataFrame({"prices": [100, 101, np.nan, 100, 89, 88]},
...                    index=date_index)
>>> df2
            prices
2010-01-01   100.0
2010-01-02   101.0
2010-01-03     NaN
2010-01-04   100.0
2010-01-05    89.0
2010-01-06    88.0

Suppose we decide to expand the dataframe to cover a wider date range.

>>> date_index2 = pd.date_range('12/29/2009', periods=10, freq='D')
>>> df2.reindex(date_index2)
            prices
2009-12-29     NaN
2009-12-30     NaN
2009-12-31     NaN
2010-01-01   100.0
2010-01-02   101.0
2010-01-03     NaN
2010-01-04   100.0
2010-01-05    89.0
2010-01-06    88.0
2010-01-07     NaN

The index entries that did not have a value in the original data frame (for example, ‘2009-12-29’) are by default filled with NaN. If desired, we can fill in the missing values using one of several options.

For example, to back-propagate the last valid value to fill the NaN values, pass bfill as an argument to the method keyword.

>>> df2.reindex(date_index2, method='bfill')
            prices
2009-12-29   100.0
2009-12-30   100.0
2009-12-31   100.0
2010-01-01   100.0
2010-01-02   101.0
2010-01-03     NaN
2010-01-04   100.0
2010-01-05    89.0
2010-01-06    88.0
2010-01-07     NaN

Please note that the NaN value present in the original dataframe (at index value 2010-01-03) will not be filled by any of the value propagation schemes. This is because filling while reindexing does not look at dataframe values, but only compares the original and desired indexes. If you do want to fill in the NaN values present in the original dataframe, use the fillna() method.

See the user guide for more.

sort_by_position() RangeFrame

Sort by Start and End columns.

sort_ranges(by: str | Iterable[str] | None = None, *, natsort: bool = True, sort_rows_reverse_order: Sequence[bool] | None = None) RangeFrame

Sort RangeFrame according to Start, End, and any other columns given.

For uses not covered by this function, use DataFrame.sort_values().

Parameters:
  • by (str or list of str, default None) – in the desired order as part of the ‘by’ argument.

  • natsort (bool, default False) – Whether to use natural sorting for the columns in match_by.

  • sort_rows_reverse_order (sequence of bools or None) – Whether to sort these rows in the reverse order for the starts and ends.

Returns:

Sorted RangeFrame. The index is preserved. Use .reset_index(drop=True) to reset the index.

Return type:

RangeFrame

subtract_overlaps(other: RangeFrame, match_by: str | Iterable[str] | None = None, *, preserve_input_order: bool = True) RangeFrame

Subtract intervals, i.e. return non-overlapping subintervals.

Identify intervals in other that overlap with intervals in self; return self with the overlapping parts removed.

Parameters:
  • other – RangeFrame to subtract.

  • match_by (str or list, default None) – If provided, only intervals with an equal value in column(s) match_by may be considered as overlapping.

  • preserve_input_order (bool, default True) –

    Whether to preserve the original input order in the result.

    If False, rows may be returned in algorithm/output order instead, which can be faster for large results.

Returns:

RangeFrame with subintervals from self that do not overlap with any interval in other. Columns and index are preserved.

Return type:

RangeFrame

Warning

The returned Pyranges may have index duplicates. Call .reset_index(drop=True) to fix it.

See also

RangeFrame.overlap

use with invert=True to return all intervals without overlap

RangeFrame.complement_ranges

return the internal complement_ranges of intervals, i.e. its introns.