Stats: statistics
Utilities for calculating statistics about genomic intervals.
Statistics useful for genomics.
- class pyranges1.ext.stats.Any(*args, **kwargs)
Special type indicating an unconstrained type.
Any is compatible with every type.
Any assumed to have all methods.
All values assumed to be instances of Any.
Note that all the above statements are true from the point of view of static type checkers. At runtime, Any should not be used with instance checks.
- class pyranges1.ext.stats.DataFrame(data=None, index: Axes | None = None, columns: Axes | None = None, dtype: Dtype | None = None, copy: bool | None = None)
Two-dimensional, size-mutable, potentially heterogeneous tabular data.
Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.
- Parameters:
data (ndarray (structured or homogeneous), Iterable, dict, or DataFrame) –
Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order. If a dict contains Series which have an index defined, it is aligned by its index. This alignment also occurs if data is a Series or a DataFrame itself. Alignment is done on Series/DataFrame inputs.
If data is a list of dicts, column order follows insertion-order.
index (Index or array-like) – Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided.
columns (Index or array-like) – Column labels to use for resulting frame when data does not have them, defaulting to RangeIndex(0, 1, 2, …, n). If data contains column labels, will perform column selection instead.
dtype (dtype, default None) – Data type to force. Only a single dtype is allowed. If None, infer.
copy (bool or None, default None) –
Copy data from inputs. For dict data, the default of None behaves like
copy=True. For DataFrame or 2d ndarray input, the default of None behaves likecopy=False. If data is a dict containing one or more Series (possibly of different dtypes),copy=Falsewill ensure that these inputs are not copied.Changed in version 1.3.0.
See also
DataFrame.from_recordsConstructor from tuples, also record arrays.
DataFrame.from_dictFrom dicts of Series, arrays, or dicts.
read_csvRead a comma-separated values (csv) file into DataFrame.
read_tableRead general delimited file into DataFrame.
read_clipboardRead text from clipboard into DataFrame.
Notes
Please reference the User Guide for more information.
Examples
Constructing DataFrame from a dictionary.
>>> d = {'col1': [1, 2], 'col2': [3, 4]} >>> df = pd.DataFrame(data=d) >>> df col1 col2 0 1 3 1 2 4
Notice that the inferred dtype is int64.
>>> df.dtypes col1 int64 col2 int64 dtype: object
To enforce a single dtype:
>>> df = pd.DataFrame(data=d, dtype=np.int8) >>> df.dtypes col1 int8 col2 int8 dtype: object
Constructing DataFrame from a dictionary including Series:
>>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])} >>> pd.DataFrame(data=d, index=[0, 1, 2, 3]) col1 col2 0 0 NaN 1 1 NaN 2 2 2.0 3 3 3.0
Constructing DataFrame from numpy ndarray:
>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), ... columns=['a', 'b', 'c']) >>> df2 a b c 0 1 2 3 1 4 5 6 2 7 8 9
Constructing DataFrame from a numpy ndarray that has labeled columns:
>>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)], ... dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")]) >>> df3 = pd.DataFrame(data, columns=['c', 'a']) ... >>> df3 c a 0 3 1 1 6 4 2 9 7
Constructing DataFrame from dataclass:
>>> from dataclasses import make_dataclass >>> Point = make_dataclass("Point", [("x", int), ("y", int)]) >>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)]) x y 0 0 0 1 0 3 2 2 3
Constructing DataFrame from Series/DataFrame:
>>> ser = pd.Series([1, 2, 3], index=["a", "b", "c"]) >>> df = pd.DataFrame(data=ser, index=["a", "c"]) >>> df 0 a 1 c 3
>>> df1 = pd.DataFrame([1, 2, 3], index=["a", "b", "c"], columns=["x"]) >>> df2 = pd.DataFrame(data=df1, index=["a", "c"]) >>> df2 x a 1 c 3
- property T: DataFrame
The transpose of the DataFrame.
- Returns:
The transposed DataFrame.
- Return type:
See also
DataFrame.transposeTranspose index and columns.
Examples
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df col1 col2 0 1 3 1 2 4
>>> df.T 0 1 col1 1 2 col2 3 4
- add(other, axis: Axis = 'columns', level=None, fill_value=None) DataFrame
Get Addition of dataframe and other, element-wise (binary operator add).
Equivalent to
dataframe + other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, radd.Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- agg(func=None, axis: Axis = 0, *args, **kwargs)
Aggregate using one or more operations over the specified axis.
- Parameters:
func (function, str, list or dict) –
Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply.
Accepted combinations are:
function
string function name
list of functions and/or function names, e.g.
[np.sum, 'mean']dict of axis labels -> functions, function names or list of such.
axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.
*args – Positional arguments to pass to func.
**kwargs – Keyword arguments to pass to func.
- Returns:
The return can be:
scalar : when Series.agg is called with single function
Series : when DataFrame.agg is called with a single function
DataFrame : when DataFrame.agg is called with several functions
- Return type:
See also
DataFrame.applyPerform any type of operations.
DataFrame.transformPerform transformation type operations.
pandas.DataFrame.groupbyPerform operations over groups.
pandas.DataFrame.resamplePerform operations over resampled bins.
pandas.DataFrame.rollingPerform operations over rolling window.
pandas.DataFrame.expandingPerform operations over expanding window.
pandas.core.window.ewm.ExponentialMovingWindowPerform operation over exponential weighted window.
Notes
The aggregation operations are always performed over an axis, either the index (default) or the column axis. This behavior is different from numpy aggregation functions (mean, median, prod, sum, std, var), where the default is to compute the aggregation of the flattened array, e.g.,
numpy.mean(arr_2d)as opposed tonumpy.mean(arr_2d, axis=0).agg is an alias for aggregate. Use the alias.
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.
A passed user-defined-function will be passed a Series for evaluation.
Examples
>>> df = pd.DataFrame([[1, 2, 3], ... [4, 5, 6], ... [7, 8, 9], ... [np.nan, np.nan, np.nan]], ... columns=['A', 'B', 'C'])
Aggregate these functions over the rows.
>>> df.agg(['sum', 'min']) A B C sum 12.0 15.0 18.0 min 1.0 2.0 3.0
Different aggregations per column.
>>> df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']}) A B sum 12.0 NaN min 1.0 2.0 max NaN 8.0
Aggregate different functions over the columns and rename the index of the resulting DataFrame.
>>> df.agg(x=('A', 'max'), y=('B', 'min'), z=('C', 'mean')) A B C x 7.0 NaN NaN y NaN 2.0 NaN z NaN NaN 6.0
Aggregate over the columns.
>>> df.agg("mean", axis="columns") 0 2.0 1 5.0 2 8.0 3 NaN dtype: float64
- aggregate(func=None, axis: Axis = 0, *args, **kwargs)
Aggregate using one or more operations over the specified axis.
- Parameters:
func (function, str, list or dict) –
Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply.
Accepted combinations are:
function
string function name
list of functions and/or function names, e.g.
[np.sum, 'mean']dict of axis labels -> functions, function names or list of such.
axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.
*args – Positional arguments to pass to func.
**kwargs – Keyword arguments to pass to func.
- Returns:
The return can be:
scalar : when Series.agg is called with single function
Series : when DataFrame.agg is called with a single function
DataFrame : when DataFrame.agg is called with several functions
- Return type:
See also
DataFrame.applyPerform any type of operations.
DataFrame.transformPerform transformation type operations.
pandas.DataFrame.groupbyPerform operations over groups.
pandas.DataFrame.resamplePerform operations over resampled bins.
pandas.DataFrame.rollingPerform operations over rolling window.
pandas.DataFrame.expandingPerform operations over expanding window.
pandas.core.window.ewm.ExponentialMovingWindowPerform operation over exponential weighted window.
Notes
The aggregation operations are always performed over an axis, either the index (default) or the column axis. This behavior is different from numpy aggregation functions (mean, median, prod, sum, std, var), where the default is to compute the aggregation of the flattened array, e.g.,
numpy.mean(arr_2d)as opposed tonumpy.mean(arr_2d, axis=0).agg is an alias for aggregate. Use the alias.
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.
A passed user-defined-function will be passed a Series for evaluation.
Examples
>>> df = pd.DataFrame([[1, 2, 3], ... [4, 5, 6], ... [7, 8, 9], ... [np.nan, np.nan, np.nan]], ... columns=['A', 'B', 'C'])
Aggregate these functions over the rows.
>>> df.agg(['sum', 'min']) A B C sum 12.0 15.0 18.0 min 1.0 2.0 3.0
Different aggregations per column.
>>> df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']}) A B sum 12.0 NaN min 1.0 2.0 max NaN 8.0
Aggregate different functions over the columns and rename the index of the resulting DataFrame.
>>> df.agg(x=('A', 'max'), y=('B', 'min'), z=('C', 'mean')) A B C x 7.0 NaN NaN y NaN 2.0 NaN z NaN NaN 6.0
Aggregate over the columns.
>>> df.agg("mean", axis="columns") 0 2.0 1 5.0 2 8.0 3 NaN dtype: float64
- all(axis: Axis | None = 0, bool_only: bool = False, skipna: bool = True, **kwargs) Series | bool
Return whether all elements are True, potentially over an axis.
Returns True unless there at least one element within a series or along a Dataframe axis that is False or equivalent (e.g. zero or empty).
- Parameters:
axis ({0 or 'index', 1 or 'columns', None}, default 0) –
Indicate which axis or axes should be reduced. For Series this parameter is unused and defaults to 0.
0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
None : reduce all axes, return a scalar.
bool_only (bool, default False) – Include only boolean columns. Not implemented for Series.
skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be True, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
**kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
If level is specified, then, DataFrame is returned; otherwise, Series is returned.
- Return type:
See also
Series.allReturn True if all elements are True.
DataFrame.anyReturn True if one (or more) elements are True.
Examples
Series
>>> pd.Series([True, True]).all() True >>> pd.Series([True, False]).all() False >>> pd.Series([], dtype="float64").all() True >>> pd.Series([np.nan]).all() True >>> pd.Series([np.nan]).all(skipna=False) True
DataFrames
Create a dataframe from a dictionary.
>>> df = pd.DataFrame({'col1': [True, True], 'col2': [True, False]}) >>> df col1 col2 0 True True 1 True False
Default behaviour checks if values in each column all return True.
>>> df.all() col1 True col2 False dtype: bool
Specify
axis='columns'to check if values in each row all return True.>>> df.all(axis='columns') 0 True 1 False dtype: bool
Or
axis=Nonefor whether every value is True.>>> df.all(axis=None) False
- any(*, axis: Axis | None = 0, bool_only: bool = False, skipna: bool = True, **kwargs) Series | bool
Return whether any element is True, potentially over an axis.
Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).
- Parameters:
axis ({0 or 'index', 1 or 'columns', None}, default 0) –
Indicate which axis or axes should be reduced. For Series this parameter is unused and defaults to 0.
0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
None : reduce all axes, return a scalar.
bool_only (bool, default False) – Include only boolean columns. Not implemented for Series.
skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
**kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
If level is specified, then, DataFrame is returned; otherwise, Series is returned.
- Return type:
See also
numpy.anyNumpy version of this method.
Series.anyReturn whether any element is True.
Series.allReturn whether all elements are True.
DataFrame.anyReturn whether any element is True over requested axis.
DataFrame.allReturn whether all elements are True over requested axis.
Examples
Series
For Series input, the output is a scalar indicating whether any element is True.
>>> pd.Series([False, False]).any() False >>> pd.Series([True, False]).any() True >>> pd.Series([], dtype="float64").any() False >>> pd.Series([np.nan]).any() False >>> pd.Series([np.nan]).any(skipna=False) True
DataFrame
Whether each column contains at least one True element (the default).
>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]}) >>> df A B C 0 1 0 0 1 2 2 0
>>> df.any() A True B True C False dtype: bool
Aggregating over the columns.
>>> df = pd.DataFrame({"A": [True, False], "B": [1, 2]}) >>> df A B 0 True 1 1 False 2
>>> df.any(axis='columns') 0 True 1 True dtype: bool
>>> df = pd.DataFrame({"A": [True, False], "B": [1, 0]}) >>> df A B 0 True 1 1 False 0
>>> df.any(axis='columns') 0 True 1 False dtype: bool
Aggregating over the entire DataFrame with
axis=None.>>> df.any(axis=None) True
any for an empty DataFrame is an empty Series.
>>> pd.DataFrame([]).any() Series([], dtype: bool)
- apply(func: AggFuncType, axis: Axis = 0, raw: bool = False, result_type: Literal['expand', 'reduce', 'broadcast'] | None = None, args=(), by_row: Literal[False, 'compat'] = 'compat', engine: Literal['python', 'numba'] = 'python', engine_kwargs: dict[str, bool] | None = None, **kwargs)
Apply a function along an axis of the DataFrame.
Objects passed to the function are Series objects whose index is either the DataFrame’s index (
axis=0) or the DataFrame’s columns (axis=1). By default (result_type=None), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result_type argument.- Parameters:
func (function) – Function to apply to each column or row.
axis ({0 or 'index', 1 or 'columns'}, default 0) –
Axis along which the function is applied:
0 or ‘index’: apply function to each column.
1 or ‘columns’: apply function to each row.
raw (bool, default False) –
Determines if row or column is passed as a Series or ndarray object:
False: passes each row or column as a Series to the function.True: the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance.
result_type ({'expand', 'reduce', 'broadcast', None}, default None) –
These only act when
axis=1(columns):’expand’ : list-like results will be turned into columns.
’reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.
’broadcast’ : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.
The default behaviour (None) depends on the return value of the applied function: list-like results will be returned as a Series of those. However if the apply function returns a Series these are expanded to columns.
args (tuple) – Positional arguments to pass to func in addition to the array/series.
by_row (False or "compat", default "compat") –
Only has an effect when
funcis a listlike or dictlike of funcs and the func isn’t a string. If “compat”, will if possible first translate the func into pandas methods (e.g.Series().apply(np.sum)will be translated toSeries().sum()). If that doesn’t work, will try call to apply again withby_row=Trueand if that fails, will call apply again withby_row=False(backward compatible). If False, the funcs will be passed the whole Series at once.Added in version 2.1.0.
engine ({'python', 'numba'}, default 'python') –
Choose between the python (default) engine or the numba engine in apply.
The numba engine will attempt to JIT compile the passed function, which may result in speedups for large DataFrames. It also supports the following engine_kwargs :
nopython (compile the function in nopython mode)
nogil (release the GIL inside the JIT compiled function)
parallel (try to apply the function in parallel over the DataFrame)
Note: Due to limitations within numba/how pandas interfaces with numba, you should only use this if raw=True
Note: The numba compiler only supports a subset of valid Python/numpy operations.
Please read more about the supported python features and supported numpy features in numba to learn what you can or cannot use in the passed function.
Added in version 2.2.0.
engine_kwargs (dict) – Pass keyword arguments to the engine. This is currently only used by the numba engine, see the documentation for the engine argument for more information.
**kwargs – Additional keyword arguments to pass as keywords arguments to func.
- Returns:
Result of applying
funcalong the given axis of the DataFrame.- Return type:
See also
DataFrame.mapFor elementwise operations.
DataFrame.aggregateOnly perform aggregating type operations.
DataFrame.transformOnly perform transforming type operations.
Notes
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.
Examples
>>> df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B']) >>> df A B 0 4 9 1 4 9 2 4 9
Using a numpy universal function (in this case the same as
np.sqrt(df)):>>> df.apply(np.sqrt) A B 0 2.0 3.0 1 2.0 3.0 2 2.0 3.0
Using a reducing function on either axis
>>> df.apply(np.sum, axis=0) A 12 B 27 dtype: int64
>>> df.apply(np.sum, axis=1) 0 13 1 13 2 13 dtype: int64
Returning a list-like will result in a Series
>>> df.apply(lambda x: [1, 2], axis=1) 0 [1, 2] 1 [1, 2] 2 [1, 2] dtype: object
Passing
result_type='expand'will expand list-like results to columns of a Dataframe>>> df.apply(lambda x: [1, 2], axis=1, result_type='expand') 0 1 0 1 2 1 1 2 2 1 2
Returning a Series inside the function is similar to passing
result_type='expand'. The resulting column names will be the Series index.>>> df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1) foo bar 0 1 2 1 1 2 2 1 2
Passing
result_type='broadcast'will ensure the same shape result, whether list-like or scalar is returned by the function, and broadcast it along the axis. The resulting column names will be the originals.>>> df.apply(lambda x: [1, 2], axis=1, result_type='broadcast') A B 0 1 2 1 1 2 2 1 2
- applymap(func: PythonFuncType, na_action: NaAction | None = None, **kwargs) DataFrame
Apply a function to a Dataframe elementwise.
Deprecated since version 2.1.0: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
This method applies a function that accepts and returns a scalar to every element of a DataFrame.
- Parameters:
func (callable) – Python function, returns a single value from a single value.
na_action ({None, 'ignore'}, default None) – If ‘ignore’, propagate NaN values, without passing them to func.
**kwargs – Additional keyword arguments to pass as keywords arguments to func.
- Returns:
Transformed DataFrame.
- Return type:
See also
DataFrame.applyApply a function along input axis of DataFrame.
DataFrame.mapApply a function along input axis of DataFrame.
DataFrame.replaceReplace values given in to_replace with value.
Examples
>>> df = pd.DataFrame([[1, 2.12], [3.356, 4.567]]) >>> df 0 1 0 1.000 2.120 1 3.356 4.567
>>> df.map(lambda x: len(str(x))) 0 1 0 3 4 1 5 5
- assign(**kwargs) DataFrame
Assign new columns to a DataFrame.
Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.
- Parameters:
**kwargs (dict of {str: callable or Series}) – The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.
- Returns:
A new DataFrame with the new columns in addition to all the existing columns.
- Return type:
Notes
Assigning multiple columns within the same
assignis possible. Later items in ‘**kwargs’ may refer to newly created or modified columns in ‘df’; items are computed and assigned into ‘df’ in order.Examples
>>> df = pd.DataFrame({'temp_c': [17.0, 25.0]}, ... index=['Portland', 'Berkeley']) >>> df temp_c Portland 17.0 Berkeley 25.0
Where the value is a callable, evaluated on df:
>>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32) temp_c temp_f Portland 17.0 62.6 Berkeley 25.0 77.0
Alternatively, the same behavior can be achieved by directly referencing an existing Series or sequence:
>>> df.assign(temp_f=df['temp_c'] * 9 / 5 + 32) temp_c temp_f Portland 17.0 62.6 Berkeley 25.0 77.0
You can create multiple columns within the same assign where one of the columns depends on another one defined within the same assign:
>>> df.assign(temp_f=lambda x: x['temp_c'] * 9 / 5 + 32, ... temp_k=lambda x: (x['temp_f'] + 459.67) * 5 / 9) temp_c temp_f temp_k Portland 17.0 62.6 290.15 Berkeley 25.0 77.0 298.15
- property axes: list[Index]
Return a list representing the axes of the DataFrame.
It has the row axis labels and column axis labels as the only members. They are returned in that order.
Examples
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df.axes [RangeIndex(start=0, stop=2, step=1), Index(['col1', 'col2'], dtype='object')]
- boxplot(column=None, by=None, ax=None, fontsize: int | None = None, rot: int = 0, grid: bool = True, figsize: tuple[float, float] | None = None, layout=None, return_type=None, backend=None, **kwargs)
Make a box plot from DataFrame columns.
Make a box-and-whisker plot from DataFrame columns, optionally grouped by some other columns. A box plot is a method for graphically depicting groups of numerical data through their quartiles. The box extends from the Q1 to Q3 quartile values of the data, with a line at the median (Q2). The whiskers extend from the edges of box to show the range of the data. By default, they extend no more than 1.5 * IQR (IQR = Q3 - Q1) from the edges of the box, ending at the farthest data point within that interval. Outliers are plotted as separate dots.
For further details see Wikipedia’s entry for boxplot.
- Parameters:
column (str or list of str, optional) – Column name or list of names, or vector. Can be any valid input to
pandas.DataFrame.groupby().by (str or array-like, optional) – Column in the DataFrame to
pandas.DataFrame.groupby(). One box-plot will be done per value of columns in by.ax (object of class matplotlib.axes.Axes, optional) – The matplotlib axes to be used by boxplot.
fontsize (float or str) – Tick label font size in points or as a string (e.g., large).
rot (float, default 0) – The rotation angle of labels (in degrees) with respect to the screen coordinate system.
grid (bool, default True) – Setting this to True will show the grid.
figsize (A tuple (width, height) in inches) – The size of the figure to create in matplotlib.
layout (tuple (rows, columns), optional) – For example, (3, 5) will display the subplots using 3 rows and 5 columns, starting from the top-left.
return_type ({'axes', 'dict', 'both'} or None, default 'axes') –
The kind of object to return. The default is
axes.’axes’ returns the matplotlib axes the boxplot is drawn on.
’dict’ returns a dictionary whose values are the matplotlib Lines of the boxplot.
’both’ returns a namedtuple with the axes and dict.
when grouping with
by, a Series mapping columns toreturn_typeis returned.If
return_typeis None, a NumPy array of axes with the same shape aslayoutis returned.
backend (str, default None) – Backend to use instead of the backend specified in the option
plotting.backend. For instance, ‘matplotlib’. Alternatively, to specify theplotting.backendfor the whole session, setpd.options.plotting.backend.**kwargs – All other plotting keyword arguments to be passed to
matplotlib.pyplot.boxplot().
- Returns:
See Notes.
- Return type:
result
See also
pandas.Series.plot.histMake a histogram.
matplotlib.pyplot.boxplotMatplotlib equivalent plot.
Notes
The return type depends on the return_type parameter:
‘axes’ : object of class matplotlib.axes.Axes
‘dict’ : dict of matplotlib.lines.Line2D objects
‘both’ : a namedtuple with structure (ax, lines)
For data grouped with
by, return a Series of the above or a numpy array:Seriesarray(forreturn_type = None)
Use
return_type='dict'when you want to tweak the appearance of the lines after plotting. In this case a dict containing the Lines making up the boxes, caps, fliers, medians, and whiskers is returned.Examples
Boxplots can be created for every column in the dataframe by
df.boxplot()or indicating the columns to be used:Boxplots of variables distributions grouped by the values of a third variable can be created using the option
by. For instance:A list of strings (i.e.
['X', 'Y']) can be passed to boxplot in order to group the data by combination of the variables in the x-axis:The layout of boxplot can be adjusted giving a tuple to
layout:Additional formatting can be done to the boxplot, like suppressing the grid (
grid=False), rotating the labels in the x-axis (i.e.rot=45) or changing the fontsize (i.e.fontsize=15):The parameter
return_typecan be used to select the type of element returned by boxplot. Whenreturn_type='axes'is selected, the matplotlib axes on which the boxplot is drawn are returned:>>> boxplot = df.boxplot(column=['Col1', 'Col2'], return_type='axes') >>> type(boxplot) <class 'matplotlib.axes._axes.Axes'>
When grouping with
by, a Series mapping columns toreturn_typeis returned:>>> boxplot = df.boxplot(column=['Col1', 'Col2'], by='X', ... return_type='axes') >>> type(boxplot) <class 'pandas.core.series.Series'>
If
return_typeis None, a NumPy array of axes with the same shape aslayoutis returned:>>> boxplot = df.boxplot(column=['Col1', 'Col2'], by='X', ... return_type=None) >>> type(boxplot) <class 'numpy.ndarray'>
- columns
The column labels of the DataFrame.
Examples
>>> df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}) >>> df A B 0 1 3 1 2 4 >>> df.columns Index(['A', 'B'], dtype='object')
- combine(other: DataFrame, func: Callable[[Series, Series], Series | Hashable], fill_value=None, overwrite: bool = True) DataFrame
Perform column-wise combine with another DataFrame.
Combines a DataFrame with other DataFrame using func to element-wise combine columns. The row and column indexes of the resulting DataFrame will be the union of the two.
- Parameters:
other (DataFrame) – The DataFrame to merge column-wise.
func (function) – Function that takes two series as inputs and return a Series or a scalar. Used to merge the two dataframes column by columns.
fill_value (scalar value, default None) – The value to fill NaNs with prior to passing any column to the merge func.
overwrite (bool, default True) – If True, columns in self that do not exist in other will be overwritten with NaNs.
- Returns:
Combination of the provided DataFrames.
- Return type:
See also
DataFrame.combine_firstCombine two DataFrame objects and default to non-null values in frame calling the method.
Examples
Combine using a simple function that chooses the smaller column.
>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> take_smaller = lambda s1, s2: s1 if s1.sum() < s2.sum() else s2 >>> df1.combine(df2, take_smaller) A B 0 0 3 1 0 3
Example using a true element-wise combine function.
>>> df1 = pd.DataFrame({'A': [5, 0], 'B': [2, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> df1.combine(df2, np.minimum) A B 0 1 2 1 0 3
Using fill_value fills Nones prior to passing the column to the merge function.
>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> df1.combine(df2, take_smaller, fill_value=-5) A B 0 0 -5.0 1 0 4.0
However, if the same element in both dataframes is None, that None is preserved
>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [None, 3]}) >>> df1.combine(df2, take_smaller, fill_value=-5) A B 0 0 -5.0 1 0 3.0
Example that demonstrates the use of overwrite and behavior when the axis differ between the dataframes.
>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]}) >>> df2 = pd.DataFrame({'B': [3, 3], 'C': [-10, 1], }, index=[1, 2]) >>> df1.combine(df2, take_smaller) A B C 0 NaN NaN NaN 1 NaN 3.0 -10.0 2 NaN 3.0 1.0
>>> df1.combine(df2, take_smaller, overwrite=False) A B C 0 0.0 NaN NaN 1 0.0 3.0 -10.0 2 NaN 3.0 1.0
Demonstrating the preference of the passed in dataframe.
>>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1], }, index=[1, 2]) >>> df2.combine(df1, take_smaller) A B C 0 0.0 NaN NaN 1 0.0 3.0 NaN 2 NaN 3.0 NaN
>>> df2.combine(df1, take_smaller, overwrite=False) A B C 0 0.0 NaN NaN 1 0.0 3.0 1.0 2 NaN 3.0 1.0
- combine_first(other: DataFrame) DataFrame
Update null elements with value in the same location in other.
Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame. The row and column indexes of the resulting DataFrame will be the union of the two. The resulting dataframe contains the ‘first’ dataframe values and overrides the second one values where both first.loc[index, col] and second.loc[index, col] are not missing values, upon calling first.combine_first(second).
- Parameters:
other (DataFrame) – Provided DataFrame to use to fill null values.
- Returns:
The result of combining the provided DataFrame with the other object.
- Return type:
See also
DataFrame.combinePerform series-wise operation on two DataFrames using a given function.
Examples
>>> df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> df1.combine_first(df2) A B 0 1.0 3.0 1 0.0 4.0
Null values still persist if the location of that null value does not exist in other
>>> df1 = pd.DataFrame({'A': [None, 0], 'B': [4, None]}) >>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1]}, index=[1, 2]) >>> df1.combine_first(df2) A B C 0 NaN 4.0 NaN 1 0.0 3.0 1.0 2 NaN 3.0 1.0
- compare(other: DataFrame, align_axis: Axis = 1, keep_shape: bool = False, keep_equal: bool = False, result_names: Suffixes = ('self', 'other')) DataFrame
Compare to another DataFrame and show the differences.
- Parameters:
other (DataFrame) – Object to compare with.
align_axis ({0 or 'index', 1 or 'columns'}, default 1) –
Determine which axis to align the comparison on.
- 0, or ‘index’Resulting differences are stacked vertically
with rows drawn alternately from self and other.
- 1, or ‘columns’Resulting differences are aligned horizontally
with columns drawn alternately from self and other.
keep_shape (bool, default False) – If true, all rows and columns are kept. Otherwise, only the ones with different values are kept.
keep_equal (bool, default False) – If true, the result keeps values that are equal. Otherwise, equal values are shown as NaNs.
result_names (tuple, default ('self', 'other')) –
Set the dataframes names in the comparison.
Added in version 1.5.0.
- Returns:
DataFrame that shows the differences stacked side by side.
The resulting index will be a MultiIndex with ‘self’ and ‘other’ stacked alternately at the inner level.
- Return type:
- Raises:
ValueError – When the two DataFrames don’t have identical labels or shape.
See also
Series.compareCompare with another Series and show differences.
DataFrame.equalsTest whether two objects contain the same elements.
Notes
Matching NaNs will not appear as a difference.
Can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames
Examples
>>> df = pd.DataFrame( ... { ... "col1": ["a", "a", "b", "b", "a"], ... "col2": [1.0, 2.0, 3.0, np.nan, 5.0], ... "col3": [1.0, 2.0, 3.0, 4.0, 5.0] ... }, ... columns=["col1", "col2", "col3"], ... ) >>> df col1 col2 col3 0 a 1.0 1.0 1 a 2.0 2.0 2 b 3.0 3.0 3 b NaN 4.0 4 a 5.0 5.0
>>> df2 = df.copy() >>> df2.loc[0, 'col1'] = 'c' >>> df2.loc[2, 'col3'] = 4.0 >>> df2 col1 col2 col3 0 c 1.0 1.0 1 a 2.0 2.0 2 b 3.0 4.0 3 b NaN 4.0 4 a 5.0 5.0
Align the differences on columns
>>> df.compare(df2) col1 col3 self other self other 0 a c NaN NaN 2 NaN NaN 3.0 4.0
Assign result_names
>>> df.compare(df2, result_names=("left", "right")) col1 col3 left right left right 0 a c NaN NaN 2 NaN NaN 3.0 4.0
Stack the differences on rows
>>> df.compare(df2, align_axis=0) col1 col3 0 self a NaN other c NaN 2 self NaN 3.0 other NaN 4.0
Keep the equal values
>>> df.compare(df2, keep_equal=True) col1 col3 self other self other 0 a c 1.0 1.0 2 b b 3.0 4.0
Keep all original rows and columns
>>> df.compare(df2, keep_shape=True) col1 col2 col3 self other self other self other 0 a c NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN 3.0 4.0 3 NaN NaN NaN NaN NaN NaN 4 NaN NaN NaN NaN NaN NaN
Keep all original rows and columns and also all original values
>>> df.compare(df2, keep_shape=True, keep_equal=True) col1 col2 col3 self other self other self other 0 a c 1.0 1.0 1.0 1.0 1 a a 2.0 2.0 2.0 2.0 2 b b 3.0 3.0 3.0 4.0 3 b b NaN NaN 4.0 4.0 4 a a 5.0 5.0 5.0 5.0
- corr(method: CorrelationMethod = 'pearson', min_periods: int = 1, numeric_only: bool = False) DataFrame
Compute pairwise correlation of columns, excluding NA/null values.
- Parameters:
method ({'pearson', 'kendall', 'spearman'} or callable) –
Method of correlation:
pearson : standard correlation coefficient
kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation
- callable: callable with input two 1d ndarrays
and returning a float. Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.
min_periods (int, optional) – Minimum number of observations required per pair of columns to have a valid result. Currently only available for Pearson and Spearman correlation.
numeric_only (bool, default False) –
Include only float, int or boolean data.
Added in version 1.5.0.
Changed in version 2.0.0: The default value of
numeric_onlyis nowFalse.
- Returns:
Correlation matrix.
- Return type:
See also
DataFrame.corrwithCompute pairwise correlation with another DataFrame or Series.
Series.corrCompute the correlation between two Series.
Notes
Pearson, Kendall and Spearman correlation are currently computed using pairwise complete observations.
Examples
>>> def histogram_intersection(a, b): ... v = np.minimum(a, b).sum().round(decimals=1) ... return v >>> df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)], ... columns=['dogs', 'cats']) >>> df.corr(method=histogram_intersection) dogs cats dogs 1.0 0.3 cats 0.3 1.0
>>> df = pd.DataFrame([(1, 1), (2, np.nan), (np.nan, 3), (4, 4)], ... columns=['dogs', 'cats']) >>> df.corr(min_periods=3) dogs cats dogs 1.0 NaN cats NaN 1.0
- corrwith(other: DataFrame | Series, axis: Axis = 0, drop: bool = False, method: CorrelationMethod = 'pearson', numeric_only: bool = False) Series
Compute pairwise correlation.
Pairwise correlation is computed between rows or columns of DataFrame with rows or columns of Series or DataFrame. DataFrames are first aligned along both axes before computing the correlations.
- Parameters:
other (DataFrame, Series) – Object with which to compute correlations.
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ to compute row-wise, 1 or ‘columns’ for column-wise.
drop (bool, default False) – Drop missing indices from result.
method ({'pearson', 'kendall', 'spearman'} or callable) –
Method of correlation:
pearson : standard correlation coefficient
kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation
- callable: callable with input two 1d ndarrays
and returning a float.
numeric_only (bool, default False) –
Include only float, int or boolean data.
Added in version 1.5.0.
Changed in version 2.0.0: The default value of
numeric_onlyis nowFalse.
- Returns:
Pairwise correlations.
- Return type:
See also
DataFrame.corrCompute pairwise correlation of columns.
Examples
>>> index = ["a", "b", "c", "d", "e"] >>> columns = ["one", "two", "three", "four"] >>> df1 = pd.DataFrame(np.arange(20).reshape(5, 4), index=index, columns=columns) >>> df2 = pd.DataFrame(np.arange(16).reshape(4, 4), index=index[:4], columns=columns) >>> df1.corrwith(df2) one 1.0 two 1.0 three 1.0 four 1.0 dtype: float64
>>> df2.corrwith(df1, axis=1) a 1.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
- count(axis: Axis = 0, numeric_only: bool = False)
Count non-NA cells for each column or row.
The values None, NaN, NaT,
pandas.NAare considered NA.- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.
numeric_only (bool, default False) – Include only float, int or boolean data.
- Returns:
For each column/row the number of non-NA/null entries.
- Return type:
See also
Series.countNumber of non-NA elements in a Series.
DataFrame.value_countsCount unique combinations of columns.
DataFrame.shapeNumber of DataFrame rows and columns (including NA elements).
DataFrame.isnaBoolean same-sized DataFrame showing places of NA elements.
Examples
Constructing DataFrame from a dictionary:
>>> df = pd.DataFrame({"Person": ... ["John", "Myla", "Lewis", "John", "Myla"], ... "Age": [24., np.nan, 21., 33, 26], ... "Single": [False, True, True, True, False]}) >>> df Person Age Single 0 John 24.0 False 1 Myla NaN True 2 Lewis 21.0 True 3 John 33.0 True 4 Myla 26.0 False
Notice the uncounted NA values:
>>> df.count() Person 5 Age 4 Single 5 dtype: int64
Counts for each row:
>>> df.count(axis='columns') 0 3 1 2 2 3 3 3 4 3 dtype: int64
- cov(min_periods: int | None = None, ddof: int | None = 1, numeric_only: bool = False) DataFrame
Compute pairwise covariance of columns, excluding NA/null values.
Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance matrix of the columns of the DataFrame.
Both NA and null values are automatically excluded from the calculation. (See the note below about bias from missing values.) A threshold can be set for the minimum number of observations for each value created. Comparisons with observations below this threshold will be returned as
NaN.This method is generally used for the analysis of time series data to understand the relationship between different measures across time.
- Parameters:
min_periods (int, optional) – Minimum number of observations required per pair of columns to have a valid result.
ddof (int, default 1) – Delta degrees of freedom. The divisor used in calculations is
N - ddof, whereNrepresents the number of elements. This argument is applicable only when nonanis in the dataframe.numeric_only (bool, default False) –
Include only float, int or boolean data.
Added in version 1.5.0.
Changed in version 2.0.0: The default value of
numeric_onlyis nowFalse.
- Returns:
The covariance matrix of the series of the DataFrame.
- Return type:
See also
Series.covCompute covariance with another Series.
core.window.ewm.ExponentialMovingWindow.covExponential weighted sample covariance.
core.window.expanding.Expanding.covExpanding sample covariance.
core.window.rolling.Rolling.covRolling sample covariance.
Notes
Returns the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-ddof.
For DataFrames that have Series that are missing data (assuming that data is missing at random) the returned covariance matrix will be an unbiased estimate of the variance and covariance between the member Series.
However, for many applications this estimate may not be acceptable because the estimate covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.
Examples
>>> df = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)], ... columns=['dogs', 'cats']) >>> df.cov() dogs cats dogs 0.666667 -1.000000 cats -1.000000 1.666667
>>> np.random.seed(42) >>> df = pd.DataFrame(np.random.randn(1000, 5), ... columns=['a', 'b', 'c', 'd', 'e']) >>> df.cov() a b c d e a 0.998438 -0.020161 0.059277 -0.008943 0.014144 b -0.020161 1.059352 -0.008543 -0.024738 0.009826 c 0.059277 -0.008543 1.010670 -0.001486 -0.000271 d -0.008943 -0.024738 -0.001486 0.921297 -0.013692 e 0.014144 0.009826 -0.000271 -0.013692 0.977795
Minimum number of periods
This method also supports an optional
min_periodskeyword that specifies the required minimum number of non-NA observations for each column pair in order to have a valid result:>>> np.random.seed(42) >>> df = pd.DataFrame(np.random.randn(20, 3), ... columns=['a', 'b', 'c']) >>> df.loc[df.index[:5], 'a'] = np.nan >>> df.loc[df.index[5:10], 'b'] = np.nan >>> df.cov(min_periods=12) a b c a 0.316741 NaN -0.150812 b NaN 1.248003 0.191417 c -0.150812 0.191417 0.895202
- cummax(axis: Axis | None = None, skipna: bool = True, *args, **kwargs)
Return cumulative maximum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative maximum.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args – Additional keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
Return cumulative maximum of Series or DataFrame.
- Return type:
See also
core.window.expanding.Expanding.maxSimilar functionality but ignores
NaNvalues.DataFrame.maxReturn the maximum over DataFrame axis.
DataFrame.cummaxReturn cumulative maximum over DataFrame axis.
DataFrame.cumminReturn cumulative minimum over DataFrame axis.
DataFrame.cumsumReturn cumulative sum over DataFrame axis.
DataFrame.cumprodReturn cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cummax() 0 2.0 1 NaN 2 5.0 3 5.0 4 5.0 dtype: float64
To include NA values in the operation, use
skipna=False>>> s.cummax(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the maximum in each column. This is equivalent to
axis=Noneoraxis='index'.>>> df.cummax() A B 0 2.0 1.0 1 3.0 NaN 2 3.0 1.0
To iterate over columns and find the maximum in each row, use
axis=1>>> df.cummax(axis=1) A B 0 2.0 2.0 1 3.0 NaN 2 1.0 1.0
- cummin(axis: Axis | None = None, skipna: bool = True, *args, **kwargs)
Return cumulative minimum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative minimum.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args – Additional keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
Return cumulative minimum of Series or DataFrame.
- Return type:
See also
core.window.expanding.Expanding.minSimilar functionality but ignores
NaNvalues.DataFrame.minReturn the minimum over DataFrame axis.
DataFrame.cummaxReturn cumulative maximum over DataFrame axis.
DataFrame.cumminReturn cumulative minimum over DataFrame axis.
DataFrame.cumsumReturn cumulative sum over DataFrame axis.
DataFrame.cumprodReturn cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cummin() 0 2.0 1 NaN 2 2.0 3 -1.0 4 -1.0 dtype: float64
To include NA values in the operation, use
skipna=False>>> s.cummin(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the minimum in each column. This is equivalent to
axis=Noneoraxis='index'.>>> df.cummin() A B 0 2.0 1.0 1 2.0 NaN 2 1.0 0.0
To iterate over columns and find the minimum in each row, use
axis=1>>> df.cummin(axis=1) A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
- cumprod(axis: Axis | None = None, skipna: bool = True, *args, **kwargs)
Return cumulative product over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative product.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args – Additional keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
Return cumulative product of Series or DataFrame.
- Return type:
See also
core.window.expanding.Expanding.prodSimilar functionality but ignores
NaNvalues.DataFrame.prodReturn the product over DataFrame axis.
DataFrame.cummaxReturn cumulative maximum over DataFrame axis.
DataFrame.cumminReturn cumulative minimum over DataFrame axis.
DataFrame.cumsumReturn cumulative sum over DataFrame axis.
DataFrame.cumprodReturn cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cumprod() 0 2.0 1 NaN 2 10.0 3 -10.0 4 -0.0 dtype: float64
To include NA values in the operation, use
skipna=False>>> s.cumprod(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the product in each column. This is equivalent to
axis=Noneoraxis='index'.>>> df.cumprod() A B 0 2.0 1.0 1 6.0 NaN 2 6.0 0.0
To iterate over columns and find the product in each row, use
axis=1>>> df.cumprod(axis=1) A B 0 2.0 2.0 1 3.0 NaN 2 1.0 0.0
- cumsum(axis: Axis | None = None, skipna: bool = True, *args, **kwargs)
Return cumulative sum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative sum.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args – Additional keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
Return cumulative sum of Series or DataFrame.
- Return type:
See also
core.window.expanding.Expanding.sumSimilar functionality but ignores
NaNvalues.DataFrame.sumReturn the sum over DataFrame axis.
DataFrame.cummaxReturn cumulative maximum over DataFrame axis.
DataFrame.cumminReturn cumulative minimum over DataFrame axis.
DataFrame.cumsumReturn cumulative sum over DataFrame axis.
DataFrame.cumprodReturn cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cumsum() 0 2.0 1 NaN 2 7.0 3 6.0 4 6.0 dtype: float64
To include NA values in the operation, use
skipna=False>>> s.cumsum(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the sum in each column. This is equivalent to
axis=Noneoraxis='index'.>>> df.cumsum() A B 0 2.0 1.0 1 5.0 NaN 2 6.0 1.0
To iterate over columns and find the sum in each row, use
axis=1>>> df.cumsum(axis=1) A B 0 2.0 3.0 1 3.0 NaN 2 1.0 1.0
- diff(periods: int = 1, axis: Axis = 0) DataFrame
First discrete difference of element.
Calculates the difference of a DataFrame element compared with another element in the DataFrame (default is element in previous row).
- Parameters:
periods (int, default 1) – Periods to shift for calculating difference, accepts negative values.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Take difference over rows (0) or columns (1).
- Returns:
First differences of the Series.
- Return type:
See also
DataFrame.pct_changePercent change over given number of periods.
DataFrame.shiftShift index by desired number of periods with an optional time freq.
Series.diffFirst discrete difference of object.
Notes
For boolean dtypes, this uses
operator.xor()rather thanoperator.sub(). The result is calculated according to current dtype in DataFrame, however dtype of the result is always float64.Examples
Difference with previous row
>>> df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6], ... 'b': [1, 1, 2, 3, 5, 8], ... 'c': [1, 4, 9, 16, 25, 36]}) >>> df a b c 0 1 1 1 1 2 1 4 2 3 2 9 3 4 3 16 4 5 5 25 5 6 8 36
>>> df.diff() a b c 0 NaN NaN NaN 1 1.0 0.0 3.0 2 1.0 1.0 5.0 3 1.0 1.0 7.0 4 1.0 2.0 9.0 5 1.0 3.0 11.0
Difference with previous column
>>> df.diff(axis=1) a b c 0 NaN 0 0 1 NaN -1 3 2 NaN -1 7 3 NaN -1 13 4 NaN 0 20 5 NaN 2 28
Difference with 3rd previous row
>>> df.diff(periods=3) a b c 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 3.0 2.0 15.0 4 3.0 4.0 21.0 5 3.0 6.0 27.0
Difference with following row
>>> df.diff(periods=-1) a b c 0 -1.0 0.0 -3.0 1 -1.0 -1.0 -5.0 2 -1.0 -1.0 -7.0 3 -1.0 -2.0 -9.0 4 -1.0 -3.0 -11.0 5 NaN NaN NaN
Overflow in input dtype
>>> df = pd.DataFrame({'a': [1, 0]}, dtype=np.uint8) >>> df.diff() a 0 NaN 1 255.0
- div(other, axis: Axis = 'columns', level=None, fill_value=None) DataFrame
Get Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to
dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- divide(other, axis: Axis = 'columns', level=None, fill_value=None) DataFrame
Get Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to
dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- dot(other: Series) Series
- dot(other: DataFrame | Index | ArrayLike) DataFrame
Compute the matrix multiplication between the DataFrame and other.
This method computes the matrix product between the DataFrame and the values of an other Series, DataFrame or a numpy array.
It can also be called using
self @ other.- Parameters:
other (Series, DataFrame or array-like) – The other object to compute the matrix product with.
- Returns:
If other is a Series, return the matrix product between self and other as a Series. If other is a DataFrame or a numpy.array, return the matrix product of self and other in a DataFrame of a np.array.
- Return type:
See also
Series.dotSimilar method for Series.
Notes
The dimensions of DataFrame and other must be compatible in order to compute the matrix multiplication. In addition, the column names of DataFrame and the index of other must contain the same values, as they will be aligned prior to the multiplication.
The dot method for Series computes the inner product, instead of the matrix product here.
Examples
Here we multiply a DataFrame with a Series.
>>> df = pd.DataFrame([[0, 1, -2, -1], [1, 1, 1, 1]]) >>> s = pd.Series([1, 1, 2, 1]) >>> df.dot(s) 0 -4 1 5 dtype: int64
Here we multiply a DataFrame with another DataFrame.
>>> other = pd.DataFrame([[0, 1], [1, 2], [-1, -1], [2, 0]]) >>> df.dot(other) 0 1 0 1 4 1 2 2
Note that the dot method give the same result as @
>>> df @ other 0 1 0 1 4 1 2 2
The dot method works also if other is an np.array.
>>> arr = np.array([[0, 1], [1, 2], [-1, -1], [2, 0]]) >>> df.dot(arr) 0 1 0 1 4 1 2 2
Note how shuffling of the objects does not change the result.
>>> s2 = s.reindex([1, 0, 2, 3]) >>> df.dot(s2) 0 -4 1 5 dtype: int64
- drop(labels: IndexLabel = None, *, axis: Axis = 0, index: IndexLabel = None, columns: IndexLabel = None, level: Level = None, inplace: Literal[True], errors: IgnoreRaise = 'raise') None
- drop(labels: IndexLabel = None, *, axis: Axis = 0, index: IndexLabel = None, columns: IndexLabel = None, level: Level = None, inplace: Literal[False] = False, errors: IgnoreRaise = 'raise') DataFrame
- drop(labels: IndexLabel = None, *, axis: Axis = 0, index: IndexLabel = None, columns: IndexLabel = None, level: Level = None, inplace: bool = False, errors: IgnoreRaise = 'raise') DataFrame | None
Drop specified labels from rows or columns.
Remove rows or columns by specifying label names and corresponding axis, or by directly specifying index or column names. When using a multi-index, labels on different levels can be removed by specifying the level. See the user guide for more information about the now unused levels.
- Parameters:
labels (single label or list-like) – Index or column labels to drop. A tuple will be used as a single label and not treated as a list-like.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
index (single label or list-like) – Alternative to specifying axis (
labels, axis=0is equivalent toindex=labels).columns (single label or list-like) – Alternative to specifying axis (
labels, axis=1is equivalent tocolumns=labels).level (int or level name, optional) – For MultiIndex, level from which the labels will be removed.
inplace (bool, default False) – If False, return a copy. Otherwise, do operation in place and return None.
errors ({'ignore', 'raise'}, default 'raise') – If ‘ignore’, suppress error and only existing labels are dropped.
- Returns:
Returns DataFrame or None DataFrame with the specified index or column labels removed or None if inplace=True.
- Return type:
DataFrame or None
- Raises:
KeyError – If any of the labels is not found in the selected axis.
See also
DataFrame.locLabel-location based indexer for selection by label.
DataFrame.dropnaReturn DataFrame with labels on given axis omitted where (all or any) data are missing.
DataFrame.drop_duplicatesReturn DataFrame with duplicate rows removed, optionally only considering certain columns.
Series.dropReturn Series with specified index labels removed.
Examples
>>> df = pd.DataFrame(np.arange(12).reshape(3, 4), ... columns=['A', 'B', 'C', 'D']) >>> df A B C D 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11
Drop columns
>>> df.drop(['B', 'C'], axis=1) A D 0 0 3 1 4 7 2 8 11
>>> df.drop(columns=['B', 'C']) A D 0 0 3 1 4 7 2 8 11
Drop a row by index
>>> df.drop([0, 1]) A B C D 2 8 9 10 11
Drop columns and/or rows of MultiIndex DataFrame
>>> midx = pd.MultiIndex(levels=[['llama', 'cow', 'falcon'], ... ['speed', 'weight', 'length']], ... codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2], ... [0, 1, 2, 0, 1, 2, 0, 1, 2]]) >>> df = pd.DataFrame(index=midx, columns=['big', 'small'], ... data=[[45, 30], [200, 100], [1.5, 1], [30, 20], ... [250, 150], [1.5, 0.8], [320, 250], ... [1, 0.8], [0.3, 0.2]]) >>> df big small llama speed 45.0 30.0 weight 200.0 100.0 length 1.5 1.0 cow speed 30.0 20.0 weight 250.0 150.0 length 1.5 0.8 falcon speed 320.0 250.0 weight 1.0 0.8 length 0.3 0.2
Drop a specific index combination from the MultiIndex DataFrame, i.e., drop the combination
'falcon'and'weight', which deletes only the corresponding row>>> df.drop(index=('falcon', 'weight')) big small llama speed 45.0 30.0 weight 200.0 100.0 length 1.5 1.0 cow speed 30.0 20.0 weight 250.0 150.0 length 1.5 0.8 falcon speed 320.0 250.0 length 0.3 0.2
>>> df.drop(index='cow', columns='small') big llama speed 45.0 weight 200.0 length 1.5 falcon speed 320.0 weight 1.0 length 0.3
>>> df.drop(index='length', level=1) big small llama speed 45.0 30.0 weight 200.0 100.0 cow speed 30.0 20.0 weight 250.0 150.0 falcon speed 320.0 250.0 weight 1.0 0.8
- drop_duplicates(subset: Hashable | Sequence[Hashable] | None = None, *, keep: DropKeep = 'first', inplace: Literal[True], ignore_index: bool = False) None
- drop_duplicates(subset: Hashable | Sequence[Hashable] | None = None, *, keep: DropKeep = 'first', inplace: Literal[False] = False, ignore_index: bool = False) DataFrame
- drop_duplicates(subset: Hashable | Sequence[Hashable] | None = None, *, keep: DropKeep = 'first', inplace: bool = False, ignore_index: bool = False) DataFrame | None
Return DataFrame with duplicate rows removed.
Considering certain columns is optional. Indexes, including time indexes are ignored.
- Parameters:
subset (column label or sequence of labels, optional) – Only consider certain columns for identifying duplicates, by default use all of the columns.
keep ({‘first’, ‘last’,
False}, default ‘first’) –Determines which duplicates (if any) to keep.
’first’ : Drop duplicates except for the first occurrence.
’last’ : Drop duplicates except for the last occurrence.
False: Drop all duplicates.
inplace (bool, default
False) – Whether to modify the DataFrame rather than creating a new one.ignore_index (bool, default
False) – IfTrue, the resulting axis will be labeled 0, 1, …, n - 1.
- Returns:
DataFrame with duplicates removed or None if
inplace=True.- Return type:
DataFrame or None
See also
DataFrame.value_countsCount unique combinations of columns.
Examples
Consider dataset containing ramen rating.
>>> df = pd.DataFrame({ ... 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'], ... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'], ... 'rating': [4, 4, 3.5, 15, 5] ... }) >>> df brand style rating 0 Yum Yum cup 4.0 1 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0
By default, it removes duplicate rows based on all columns.
>>> df.drop_duplicates() brand style rating 0 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0
To remove duplicates on specific column(s), use
subset.>>> df.drop_duplicates(subset=['brand']) brand style rating 0 Yum Yum cup 4.0 2 Indomie cup 3.5
To remove duplicates and keep last occurrences, use
keep.>>> df.drop_duplicates(subset=['brand', 'style'], keep='last') brand style rating 1 Yum Yum cup 4.0 2 Indomie cup 3.5 4 Indomie pack 5.0
- dropna(*, axis: Axis = 0, how: AnyAll | lib.NoDefault = <no_default>, thresh: int | lib.NoDefault = <no_default>, subset: IndexLabel = None, inplace: Literal[False] = False, ignore_index: bool = False) DataFrame
- dropna(*, axis: Axis = 0, how: AnyAll | lib.NoDefault = <no_default>, thresh: int | lib.NoDefault = <no_default>, subset: IndexLabel = None, inplace: Literal[True], ignore_index: bool = False) None
Remove missing values.
See the User Guide for more on which values are considered missing, and how to work with missing data.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) –
Determine if rows or columns which contain missing values are removed.
0, or ‘index’ : Drop rows which contain missing values.
1, or ‘columns’ : Drop columns which contain missing value.
Only a single axis is allowed.
how ({'any', 'all'}, default 'any') –
Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
’any’ : If any NA values are present, drop that row or column.
’all’ : If all values are NA, drop that row or column.
thresh (int, optional) – Require that many non-NA values. Cannot be combined with how.
subset (column label or sequence of labels, optional) – Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.
inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.
ignore_index (bool, default
False) –If
True, the resulting axis will be labeled 0, 1, …, n - 1.Added in version 2.0.0.
- Returns:
DataFrame with NA entries dropped from it or None if
inplace=True.- Return type:
DataFrame or None
See also
DataFrame.isnaIndicate missing values.
DataFrame.notnaIndicate existing (non-missing) values.
DataFrame.fillnaReplace missing values.
Series.dropnaDrop missing values.
Index.dropnaDrop missing indices.
Examples
>>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'], ... "toy": [np.nan, 'Batmobile', 'Bullwhip'], ... "born": [pd.NaT, pd.Timestamp("1940-04-25"), ... pd.NaT]}) >>> df name toy born 0 Alfred NaN NaT 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT
Drop the rows where at least one element is missing.
>>> df.dropna() name toy born 1 Batman Batmobile 1940-04-25
Drop the columns where at least one element is missing.
>>> df.dropna(axis='columns') name 0 Alfred 1 Batman 2 Catwoman
Drop the rows where all elements are missing.
>>> df.dropna(how='all') name toy born 0 Alfred NaN NaT 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT
Keep only the rows with at least 2 non-NA values.
>>> df.dropna(thresh=2) name toy born 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT
Define in which columns to look for missing values.
>>> df.dropna(subset=['name', 'toy']) name toy born 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT
- duplicated(subset: Hashable | Sequence[Hashable] | None = None, keep: DropKeep = 'first') Series
Return boolean Series denoting duplicate rows.
Considering certain columns is optional.
- Parameters:
subset (column label or sequence of labels, optional) – Only consider certain columns for identifying duplicates, by default use all of the columns.
keep ({'first', 'last', False}, default 'first') –
Determines which duplicates (if any) to mark.
first: Mark duplicates asTrueexcept for the first occurrence.last: Mark duplicates asTrueexcept for the last occurrence.False : Mark all duplicates as
True.
- Returns:
Boolean series for each duplicated rows.
- Return type:
See also
Index.duplicatedEquivalent method on index.
Series.duplicatedEquivalent method on Series.
Series.drop_duplicatesRemove duplicate values from Series.
DataFrame.drop_duplicatesRemove duplicate values from DataFrame.
Examples
Consider dataset containing ramen rating.
>>> df = pd.DataFrame({ ... 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'], ... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'], ... 'rating': [4, 4, 3.5, 15, 5] ... }) >>> df brand style rating 0 Yum Yum cup 4.0 1 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0
By default, for each set of duplicated values, the first occurrence is set on False and all others on True.
>>> df.duplicated() 0 False 1 True 2 False 3 False 4 False dtype: bool
By using ‘last’, the last occurrence of each set of duplicated values is set on False and all others on True.
>>> df.duplicated(keep='last') 0 True 1 False 2 False 3 False 4 False dtype: bool
By setting
keepon False, all duplicates are True.>>> df.duplicated(keep=False) 0 True 1 True 2 False 3 False 4 False dtype: bool
To find duplicates on specific column(s), use
subset.>>> df.duplicated(subset=['brand']) 0 False 1 True 2 False 3 True 4 True dtype: bool
- eq(other, axis: Axis = 'columns', level=None) DataFrame
Get Equal to of dataframe and other, element-wise (binary operator eq).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
- Parameters:
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- Returns:
Result of the comparison.
- Return type:
DataFrame of bool
See also
DataFrame.eqCompare DataFrames for equality elementwise.
DataFrame.neCompare DataFrames for inequality elementwise.
DataFrame.leCompare DataFrames for less than inequality or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 cost revenue A False True B False False C True False
>>> df.eq(100) cost revenue A False True B False False C True False
When other is a
Series, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150
>>> df.gt(other) cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
- eval(expr: str, *, inplace: Literal[False] = False, **kwargs) Any
- eval(expr: str, *, inplace: Literal[True], **kwargs) None
Evaluate a string describing operations on DataFrame columns.
Operates on columns only, not specific rows or elements. This allows eval to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.
- Parameters:
expr (str) – The expression string to evaluate.
inplace (bool, default False) – If the expression contains an assignment, whether to perform the operation inplace and mutate the existing DataFrame. Otherwise, a new DataFrame is returned.
**kwargs – See the documentation for
eval()for complete details on the keyword arguments accepted byquery().
- Returns:
The result of the evaluation or None if
inplace=True.- Return type:
ndarray, scalar, pandas object, or None
See also
DataFrame.queryEvaluates a boolean expression to query the columns of a frame.
DataFrame.assignCan evaluate an expression or function to create new values for a column.
evalEvaluate a Python expression as a string using various backends.
Notes
For more details see the API documentation for
eval(). For detailed examples see enhancing performance with eval.Examples
>>> df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)}) >>> df A B 0 1 10 1 2 8 2 3 6 3 4 4 4 5 2 >>> df.eval('A + B') 0 11 1 10 2 9 3 8 4 7 dtype: int64
Assignment is allowed though by default the original DataFrame is not modified.
>>> df.eval('C = A + B') A B C 0 1 10 11 1 2 8 10 2 3 6 9 3 4 4 8 4 5 2 7 >>> df A B 0 1 10 1 2 8 2 3 6 3 4 4 4 5 2
Multiple columns can be assigned to using multi-line expressions:
>>> df.eval( ... ''' ... C = A + B ... D = A - B ... ''' ... ) A B C D 0 1 10 11 -9 1 2 8 10 -6 2 3 6 9 -3 3 4 4 8 0 4 5 2 7 3
- explode(column: IndexLabel, ignore_index: bool = False) DataFrame
Transform each element of a list-like to a row, replicating index values.
- Parameters:
column (IndexLabel) –
Column(s) to explode. For multiple columns, specify a non-empty list with each element be str or tuple, and all specified columns their list-like data on same row of the frame must have matching length.
Added in version 1.3.0: Multi-column explode
ignore_index (bool, default False) – If True, the resulting index will be labeled 0, 1, …, n - 1.
- Returns:
Exploded lists to rows of the subset columns; index will be duplicated for these rows.
- Return type:
- Raises:
ValueError : –
If columns of the frame are not unique. * If specified columns to explode is empty list. * If specified columns to explode have not matching count of elements rowwise in the frame.
See also
DataFrame.unstackPivot a level of the (necessarily hierarchical) index labels.
DataFrame.meltUnpivot a DataFrame from wide format to long format.
Series.explodeExplode a DataFrame from list-like columns to long format.
Notes
This routine will explode list-likes including lists, tuples, sets, Series, and np.ndarray. The result dtype of the subset rows will be object. Scalars will be returned unchanged, and empty list-likes will result in a np.nan for that row. In addition, the ordering of rows in the output will be non-deterministic when exploding sets.
Reference the user guide for more examples.
Examples
>>> df = pd.DataFrame({'A': [[0, 1, 2], 'foo', [], [3, 4]], ... 'B': 1, ... 'C': [['a', 'b', 'c'], np.nan, [], ['d', 'e']]}) >>> df A B C 0 [0, 1, 2] 1 [a, b, c] 1 foo 1 NaN 2 [] 1 [] 3 [3, 4] 1 [d, e]
Single-column explode.
>>> df.explode('A') A B C 0 0 1 [a, b, c] 0 1 1 [a, b, c] 0 2 1 [a, b, c] 1 foo 1 NaN 2 NaN 1 [] 3 3 1 [d, e] 3 4 1 [d, e]
Multi-column explode.
>>> df.explode(list('AC')) A B C 0 0 1 a 0 1 1 b 0 2 1 c 1 foo 1 NaN 2 NaN 1 NaN 3 3 1 d 3 4 1 e
- floordiv(other, axis: Axis = 'columns', level=None, fill_value=None) DataFrame
Get Integer division of dataframe and other, element-wise (binary operator floordiv).
Equivalent to
dataframe // other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rfloordiv.Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- classmethod from_dict(data: dict, orient: FromDictOrient = 'columns', dtype: Dtype | None = None, columns: Axes | None = None) DataFrame
Construct DataFrame from dict of array-like or dicts.
Creates DataFrame object from dictionary by columns or by index allowing dtype specification.
- Parameters:
data (dict) – Of the form {field : array-like} or {field : dict}.
orient ({'columns', 'index', 'tight'}, default 'columns') –
The “orientation” of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’. If ‘tight’, assume a dict with keys [‘index’, ‘columns’, ‘data’, ‘index_names’, ‘column_names’].
Added in version 1.4.0: ‘tight’ as an allowed value for the
orientargumentdtype (dtype, default None) – Data type to force after DataFrame construction, otherwise infer.
columns (list, default None) – Column labels to use when
orient='index'. Raises a ValueError if used withorient='columns'ororient='tight'.
- Return type:
See also
DataFrame.from_recordsDataFrame from structured ndarray, sequence of tuples or dicts, or DataFrame.
DataFrameDataFrame object creation using constructor.
DataFrame.to_dictConvert the DataFrame to a dictionary.
Examples
By default the keys of the dict become the DataFrame columns:
>>> data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']} >>> pd.DataFrame.from_dict(data) col_1 col_2 0 3 a 1 2 b 2 1 c 3 0 d
Specify
orient='index'to create the DataFrame using dictionary keys as rows:>>> data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']} >>> pd.DataFrame.from_dict(data, orient='index') 0 1 2 3 row_1 3 2 1 0 row_2 a b c d
When using the ‘index’ orientation, the column names can be specified manually:
>>> pd.DataFrame.from_dict(data, orient='index', ... columns=['A', 'B', 'C', 'D']) A B C D row_1 3 2 1 0 row_2 a b c d
Specify
orient='tight'to create the DataFrame using a ‘tight’ format:>>> data = {'index': [('a', 'b'), ('a', 'c')], ... 'columns': [('x', 1), ('y', 2)], ... 'data': [[1, 3], [2, 4]], ... 'index_names': ['n1', 'n2'], ... 'column_names': ['z1', 'z2']} >>> pd.DataFrame.from_dict(data, orient='tight') z1 x y z2 1 2 n1 n2 a b 1 3 c 2 4
- classmethod from_records(data, index=None, exclude=None, columns=None, coerce_float: bool = False, nrows: int | None = None) DataFrame
Convert structured or record ndarray to DataFrame.
Creates a DataFrame object from a structured ndarray, sequence of tuples or dicts, or DataFrame.
- Parameters:
data (structured ndarray, sequence of tuples or dicts, or DataFrame) –
Structured input data.
Deprecated since version 2.1.0: Passing a DataFrame is deprecated.
index (str, list of fields, array-like) – Field of array to use as the index, alternately a specific set of input labels to use.
exclude (sequence, default None) – Columns or fields to exclude.
columns (sequence, default None) – Column names to use. If the passed data do not have names associated with them, this argument provides names for the columns. Otherwise this argument indicates the order of the columns in the result (any names not found in the data will become all-NA columns).
coerce_float (bool, default False) – Attempt to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets.
nrows (int, default None) – Number of rows to read if data is an iterator.
- Return type:
See also
DataFrame.from_dictDataFrame from dict of array-like or dicts.
DataFrameDataFrame object creation using constructor.
Examples
Data can be provided as a structured ndarray:
>>> data = np.array([(3, 'a'), (2, 'b'), (1, 'c'), (0, 'd')], ... dtype=[('col_1', 'i4'), ('col_2', 'U1')]) >>> pd.DataFrame.from_records(data) col_1 col_2 0 3 a 1 2 b 2 1 c 3 0 d
Data can be provided as a list of dicts:
>>> data = [{'col_1': 3, 'col_2': 'a'}, ... {'col_1': 2, 'col_2': 'b'}, ... {'col_1': 1, 'col_2': 'c'}, ... {'col_1': 0, 'col_2': 'd'}] >>> pd.DataFrame.from_records(data) col_1 col_2 0 3 a 1 2 b 2 1 c 3 0 d
Data can be provided as a list of tuples with corresponding columns:
>>> data = [(3, 'a'), (2, 'b'), (1, 'c'), (0, 'd')] >>> pd.DataFrame.from_records(data, columns=['col_1', 'col_2']) col_1 col_2 0 3 a 1 2 b 2 1 c 3 0 d
- ge(other, axis: Axis = 'columns', level=None) DataFrame
Get Greater than or equal to of dataframe and other, element-wise (binary operator ge).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
- Parameters:
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- Returns:
Result of the comparison.
- Return type:
DataFrame of bool
See also
DataFrame.eqCompare DataFrames for equality elementwise.
DataFrame.neCompare DataFrames for inequality elementwise.
DataFrame.leCompare DataFrames for less than inequality or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 cost revenue A False True B False False C True False
>>> df.eq(100) cost revenue A False True B False False C True False
When other is a
Series, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150
>>> df.gt(other) cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
- groupby(by=None, axis: Axis | lib.NoDefault = <no_default>, level: IndexLabel | None = None, as_index: bool = True, sort: bool = True, group_keys: bool = True, observed: bool | lib.NoDefault = <no_default>, dropna: bool = True) DataFrameGroupBy
Group DataFrame using a mapper or by a Series of columns.
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
- Parameters:
by (mapping, function, label, pd.Grouper or list of such) – Used to determine the groups for the groupby. If
byis a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see.align()method). If a list or ndarray of length equal to the selected axis is passed (see the groupby user guide), the values are used as-is to determine the groups. A label or list of labels may be passed to group by the columns inself. Notice that a tuple is interpreted as a (single) key.axis ({0 or 'index', 1 or 'columns'}, default 0) –
Split along rows (0) or columns (1). For Series this parameter is unused and defaults to 0.
Deprecated since version 2.1.0: Will be removed and behave like axis=0 in a future version. For
axis=1, doframe.T.groupby(...)instead.level (int, level name, or sequence of such, default None) – If the axis is a MultiIndex (hierarchical), group by a particular level or levels. Do not specify both
byandlevel.as_index (bool, default True) – Return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output. This argument has no effect on filtrations (see the filtrations in the user guide), such as
head(),tail(),nth()and in transformations (see the transformations in the user guide).sort (bool, default True) –
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group. If False, the groups will appear in the same order as they did in the original DataFrame. This argument has no effect on filtrations (see the filtrations in the user guide), such as
head(),tail(),nth()and in transformations (see the transformations in the user guide).Changed in version 2.0.0: Specifying
sort=Falsewith an ordered categorical grouper will no longer sort the values.group_keys (bool, default True) –
When calling apply and the
byargument produces a like-indexed (i.e. a transform) result, add group keys to index to identify pieces. By default group keys are not included when the result’s index (and column) labels match the inputs, and are included otherwise.Changed in version 1.5.0: Warns that
group_keyswill no longer be ignored when the result fromapplyis a like-indexed Series or DataFrame. Specifygroup_keysexplicitly to include the group keys or not.Changed in version 2.0.0:
group_keysnow defaults toTrue.observed (bool, default False) –
This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
Deprecated since version 2.1.0: The default value will change to True in a future version of pandas.
dropna (bool, default True) – If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.
- Returns:
Returns a groupby object that contains information about the groups.
- Return type:
pandas.api.typing.DataFrameGroupBy
See also
resampleConvenience method for frequency conversion and resampling of time series.
Notes
See the user guide for more detailed usage and examples, including splitting an object into groups, iterating through groups, selecting a group, aggregation, and more.
Examples
>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon', ... 'Parrot', 'Parrot'], ... 'Max Speed': [380., 370., 24., 26.]}) >>> df Animal Max Speed 0 Falcon 380.0 1 Falcon 370.0 2 Parrot 24.0 3 Parrot 26.0 >>> df.groupby(['Animal']).mean() Max Speed Animal Falcon 375.0 Parrot 25.0
Hierarchical Indexes
We can groupby different levels of a hierarchical index using the level parameter:
>>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'], ... ['Captive', 'Wild', 'Captive', 'Wild']] >>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type')) >>> df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]}, ... index=index) >>> df Max Speed Animal Type Falcon Captive 390.0 Wild 350.0 Parrot Captive 30.0 Wild 20.0 >>> df.groupby(level=0).mean() Max Speed Animal Falcon 370.0 Parrot 25.0 >>> df.groupby(level="Type").mean() Max Speed Type Captive 210.0 Wild 185.0
We can also choose to include NA in group keys or not by setting dropna parameter, the default setting is True.
>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]] >>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by=["b"]).sum() a c b 1.0 2 3 2.0 2 5
>>> df.groupby(by=["b"], dropna=False).sum() a c b 1.0 2 3 2.0 2 5 NaN 1 4
>>> l = [["a", 12, 12], [None, 12.3, 33.], ["b", 12.3, 123], ["a", 1, 1]] >>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by="a").sum() b c a a 13.0 13.0 b 12.3 123.0
>>> df.groupby(by="a", dropna=False).sum() b c a a 13.0 13.0 b 12.3 123.0 NaN 12.3 33.0
When using
.apply(), usegroup_keysto include or exclude the group keys. Thegroup_keysargument defaults toTrue(include).>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon', ... 'Parrot', 'Parrot'], ... 'Max Speed': [380., 370., 24., 26.]}) >>> df.groupby("Animal", group_keys=True)[['Max Speed']].apply(lambda x: x) Max Speed Animal Falcon 0 380.0 1 370.0 Parrot 2 24.0 3 26.0
>>> df.groupby("Animal", group_keys=False)[['Max Speed']].apply(lambda x: x) Max Speed 0 380.0 1 370.0 2 24.0 3 26.0
- gt(other, axis: Axis = 'columns', level=None) DataFrame
Get Greater than of dataframe and other, element-wise (binary operator gt).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
- Parameters:
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- Returns:
Result of the comparison.
- Return type:
DataFrame of bool
See also
DataFrame.eqCompare DataFrames for equality elementwise.
DataFrame.neCompare DataFrames for inequality elementwise.
DataFrame.leCompare DataFrames for less than inequality or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 cost revenue A False True B False False C True False
>>> df.eq(100) cost revenue A False True B False False C True False
When other is a
Series, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150
>>> df.gt(other) cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
- hist(column: IndexLabel | None = None, by=None, grid: bool = True, xlabelsize: int | None = None, xrot: float | None = None, ylabelsize: int | None = None, yrot: float | None = None, ax=None, sharex: bool = False, sharey: bool = False, figsize: tuple[int, int] | None = None, layout: tuple[int, int] | None = None, bins: int | Sequence[int] = 10, backend: str | None = None, legend: bool = False, **kwargs)
Make a histogram of the DataFrame’s columns.
A histogram is a representation of the distribution of data. This function calls
matplotlib.pyplot.hist(), on each series in the DataFrame, resulting in one histogram per column.- Parameters:
data (DataFrame) – The pandas object holding the data.
column (str or sequence, optional) – If passed, will be used to limit data to a subset of columns.
by (object, optional) – If passed, then used to form histograms for separate groups.
grid (bool, default True) – Whether to show axis grid lines.
xlabelsize (int, default None) – If specified changes the x-axis label size.
xrot (float, default None) – Rotation of x axis labels. For example, a value of 90 displays the x labels rotated 90 degrees clockwise.
ylabelsize (int, default None) – If specified changes the y-axis label size.
yrot (float, default None) – Rotation of y axis labels. For example, a value of 90 displays the y labels rotated 90 degrees clockwise.
ax (Matplotlib axes object, default None) – The axes to plot the histogram on.
sharex (bool, default True if ax is None else False) – In case subplots=True, share x axis and set some x axis labels to invisible; defaults to True if ax is None otherwise False if an ax is passed in. Note that passing in both an ax and sharex=True will alter all x axis labels for all subplots in a figure.
sharey (bool, default False) – In case subplots=True, share y axis and set some y axis labels to invisible.
figsize (tuple, optional) – The size in inches of the figure to create. Uses the value in matplotlib.rcParams by default.
layout (tuple, optional) – Tuple of (rows, columns) for the layout of the histograms.
bins (int or sequence, default 10) – Number of histogram bins to be used. If an integer is given, bins + 1 bin edges are calculated and returned. If bins is a sequence, gives bin edges, including left edge of first bin and right edge of last bin. In this case, bins is returned unmodified.
backend (str, default None) – Backend to use instead of the backend specified in the option
plotting.backend. For instance, ‘matplotlib’. Alternatively, to specify theplotting.backendfor the whole session, setpd.options.plotting.backend.legend (bool, default False) – Whether to show the legend.
**kwargs – All other plotting keyword arguments to be passed to
matplotlib.pyplot.hist().
- Return type:
matplotlib.AxesSubplot or numpy.ndarray of them
See also
matplotlib.pyplot.histPlot a histogram using matplotlib.
Examples
This example draws a histogram based on the length and width of some animals, displayed in three bins
- idxmax(axis: Axis = 0, skipna: bool = True, numeric_only: bool = False) Series
Return index of first occurrence of maximum over requested axis.
NA/null values are excluded.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
numeric_only (bool, default False) –
Include only float, int or boolean data.
Added in version 1.5.0.
- Returns:
Indexes of maxima along the specified axis.
- Return type:
- Raises:
ValueError –
If the row/column is empty
See also
Series.idxmaxReturn index of the maximum element.
Notes
This method is the DataFrame version of
ndarray.argmax.Examples
Consider a dataset containing food consumption in Argentina.
>>> df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48], ... 'co2_emissions': [37.2, 19.66, 1712]}, ... index=['Pork', 'Wheat Products', 'Beef'])
>>> df consumption co2_emissions Pork 10.51 37.20 Wheat Products 103.11 19.66 Beef 55.48 1712.00
By default, it returns the index for the maximum value in each column.
>>> df.idxmax() consumption Wheat Products co2_emissions Beef dtype: object
To return the index for the maximum value in each row, use
axis="columns".>>> df.idxmax(axis="columns") Pork co2_emissions Wheat Products consumption Beef co2_emissions dtype: object
- idxmin(axis: Axis = 0, skipna: bool = True, numeric_only: bool = False) Series
Return index of first occurrence of minimum over requested axis.
NA/null values are excluded.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
numeric_only (bool, default False) –
Include only float, int or boolean data.
Added in version 1.5.0.
- Returns:
Indexes of minima along the specified axis.
- Return type:
- Raises:
ValueError –
If the row/column is empty
See also
Series.idxminReturn index of the minimum element.
Notes
This method is the DataFrame version of
ndarray.argmin.Examples
Consider a dataset containing food consumption in Argentina.
>>> df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48], ... 'co2_emissions': [37.2, 19.66, 1712]}, ... index=['Pork', 'Wheat Products', 'Beef'])
>>> df consumption co2_emissions Pork 10.51 37.20 Wheat Products 103.11 19.66 Beef 55.48 1712.00
By default, it returns the index for the minimum value in each column.
>>> df.idxmin() consumption Pork co2_emissions Wheat Products dtype: object
To return the index for the minimum value in each row, use
axis="columns".>>> df.idxmin(axis="columns") Pork consumption Wheat Products co2_emissions Beef consumption dtype: object
- index
The index (row labels) of the DataFrame.
The index of a DataFrame is a series of labels that identify each row. The labels can be integers, strings, or any other hashable type. The index is used for label-based access and alignment, and can be accessed or modified using this attribute.
- Returns:
The index labels of the DataFrame.
- Return type:
pandas.Index
See also
DataFrame.columnsThe column labels of the DataFrame.
DataFrame.to_numpyConvert the DataFrame to a NumPy array.
Examples
>>> df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Aritra'], ... 'Age': [25, 30, 35], ... 'Location': ['Seattle', 'New York', 'Kona']}, ... index=([10, 20, 30])) >>> df.index Index([10, 20, 30], dtype='int64')
In this example, we create a DataFrame with 3 rows and 3 columns, including Name, Age, and Location information. We set the index labels to be the integers 10, 20, and 30. We then access the index attribute of the DataFrame, which returns an Index object containing the index labels.
>>> df.index = [100, 200, 300] >>> df Name Age Location 100 Alice 25 Seattle 200 Bob 30 New York 300 Aritra 35 Kona
In this example, we modify the index labels of the DataFrame by assigning a new list of labels to the index attribute. The DataFrame is then updated with the new labels, and the output shows the modified DataFrame.
- info(verbose: bool | None = None, buf: WriteBuffer[str] | None = None, max_cols: int | None = None, memory_usage: bool | str | None = None, show_counts: bool | None = None) None
Print a concise summary of a DataFrame.
This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.
- Parameters:
verbose (bool, optional) – Whether to print the full summary. By default, the setting in
pandas.options.display.max_info_columnsis followed.buf (writable buffer, defaults to sys.stdout) – Where to send the output. By default, the output is printed to sys.stdout. Pass a writable buffer if you need to further process the output.
max_cols (int, optional) – When to switch from the verbose to the truncated output. If the DataFrame has more than max_cols columns, the truncated output is used. By default, the setting in
pandas.options.display.max_info_columnsis used.memory_usage (bool, str, optional) –
Specifies whether total memory usage of the DataFrame elements (including the index) should be displayed. By default, this follows the
pandas.options.display.memory_usagesetting.True always show memory usage. False never shows memory usage. A value of ‘deep’ is equivalent to “True with deep introspection”. Memory usage is shown in human-readable units (base-2 representation). Without deep introspection a memory estimation is made based in column dtype and number of rows assuming values consume the same memory amount for corresponding dtypes. With deep memory introspection, a real memory usage calculation is performed at the cost of computational resources. See the Frequently Asked Questions for more details.
show_counts (bool, optional) – Whether to show the non-null counts. By default, this is shown only if the DataFrame is smaller than
pandas.options.display.max_info_rowsandpandas.options.display.max_info_columns. A value of True always shows the counts, and False never shows the counts.
- Returns:
This method prints a summary of a DataFrame and returns None.
- Return type:
None
See also
DataFrame.describeGenerate descriptive statistics of DataFrame columns.
DataFrame.memory_usageMemory usage of DataFrame columns.
Examples
>>> int_values = [1, 2, 3, 4, 5] >>> text_values = ['alpha', 'beta', 'gamma', 'delta', 'epsilon'] >>> float_values = [0.0, 0.25, 0.5, 0.75, 1.0] >>> df = pd.DataFrame({"int_col": int_values, "text_col": text_values, ... "float_col": float_values}) >>> df int_col text_col float_col 0 1 alpha 0.00 1 2 beta 0.25 2 3 gamma 0.50 3 4 delta 0.75 4 5 epsilon 1.00
Prints information of all columns:
>>> df.info(verbose=True) <class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 int_col 5 non-null int64 1 text_col 5 non-null object 2 float_col 5 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 248.0+ bytes
Prints a summary of columns count and its dtypes but not per column information:
>>> df.info(verbose=False) <class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Columns: 3 entries, int_col to float_col dtypes: float64(1), int64(1), object(1) memory usage: 248.0+ bytes
Pipe output of DataFrame.info to buffer instead of sys.stdout, get buffer content and writes to a text file:
>>> import io >>> buffer = io.StringIO() >>> df.info(buf=buffer) >>> s = buffer.getvalue() >>> with open("df_info.txt", "w", ... encoding="utf-8") as f: ... f.write(s) 260
The memory_usage parameter allows deep introspection mode, specially useful for big DataFrames and fine-tune memory optimization:
>>> random_strings_array = np.random.choice(['a', 'b', 'c'], 10 ** 6) >>> df = pd.DataFrame({ ... 'column_1': np.random.choice(['a', 'b', 'c'], 10 ** 6), ... 'column_2': np.random.choice(['a', 'b', 'c'], 10 ** 6), ... 'column_3': np.random.choice(['a', 'b', 'c'], 10 ** 6) ... }) >>> df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 1000000 entries, 0 to 999999 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 column_1 1000000 non-null object 1 column_2 1000000 non-null object 2 column_3 1000000 non-null object dtypes: object(3) memory usage: 22.9+ MB
>>> df.info(memory_usage='deep') <class 'pandas.core.frame.DataFrame'> RangeIndex: 1000000 entries, 0 to 999999 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 column_1 1000000 non-null object 1 column_2 1000000 non-null object 2 column_3 1000000 non-null object dtypes: object(3) memory usage: 165.9 MB
- insert(loc: int, column: Hashable, value: Scalar | AnyArrayLike, allow_duplicates: bool | lib.NoDefault = <no_default>) None
Insert column into DataFrame at specified location.
Raises a ValueError if column is already contained in the DataFrame, unless allow_duplicates is set to True.
- Parameters:
loc (int) – Insertion index. Must verify 0 <= loc <= len(columns).
column (str, number, or hashable object) – Label of the inserted column.
value (Scalar, Series, or array-like) – Content of the inserted column.
allow_duplicates (bool, optional, default lib.no_default) – Allow duplicate column labels to be created.
See also
Index.insertInsert new item by index.
Examples
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df col1 col2 0 1 3 1 2 4 >>> df.insert(1, "newcol", [99, 99]) >>> df col1 newcol col2 0 1 99 3 1 2 99 4 >>> df.insert(0, "col1", [100, 100], allow_duplicates=True) >>> df col1 col1 newcol col2 0 100 1 99 3 1 100 2 99 4
Notice that pandas uses index alignment in case of value from type Series:
>>> df.insert(0, "col0", pd.Series([5, 6], index=[1, 2])) >>> df col0 col1 col1 newcol col2 0 NaN 100 1 99 3 1 5.0 100 2 99 4
- isetitem(loc, value) None
Set the given value in the column with position loc.
This is a positional analogue to
__setitem__.- Parameters:
loc (int or sequence of ints) – Index position for the column.
value (scalar or arraylike) – Value(s) for the column.
Notes
frame.isetitem(loc, value)is an in-place method as it will modify the DataFrame in place (not returning a new object). In contrast toframe.iloc[:, i] = valuewhich will try to update the existing values in place,frame.isetitem(loc, value)will not update the values of the column itself in place, it will instead insert a new array.In cases where
frame.columnsis unique, this is equivalent toframe[frame.columns[i]] = value.
- isin(values: Series | DataFrame | Sequence | Mapping) DataFrame
Whether each element in the DataFrame is contained in values.
- Parameters:
values (iterable, Series, DataFrame or dict) – The result will only be true at a location if all the labels match. If values is a Series, that’s the index. If values is a dict, the keys must be the column names, which must match. If values is a DataFrame, then both the index and column labels must match.
- Returns:
DataFrame of booleans showing whether each element in the DataFrame is contained in values.
- Return type:
See also
DataFrame.eqEquality test for DataFrame.
Series.isinEquivalent method on Series.
Series.str.containsTest if pattern or regex is contained within a string of a Series or Index.
Examples
>>> df = pd.DataFrame({'num_legs': [2, 4], 'num_wings': [2, 0]}, ... index=['falcon', 'dog']) >>> df num_legs num_wings falcon 2 2 dog 4 0
When
valuesis a list check whether every value in the DataFrame is present in the list (which animals have 0 or 2 legs or wings)>>> df.isin([0, 2]) num_legs num_wings falcon True True dog False True
To check if
valuesis not in the DataFrame, use the~operator:>>> ~df.isin([0, 2]) num_legs num_wings falcon False False dog True False
When
valuesis a dict, we can pass values to check for each column separately:>>> df.isin({'num_wings': [0, 3]}) num_legs num_wings falcon False False dog False True
When
valuesis a Series or DataFrame the index and column must match. Note that ‘falcon’ does not match based on the number of legs in other.>>> other = pd.DataFrame({'num_legs': [8, 3], 'num_wings': [0, 2]}, ... index=['spider', 'falcon']) >>> df.isin(other) num_legs num_wings falcon False True dog False False
- isna() DataFrame
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True).- Returns:
Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.
- Return type:
See also
DataFrame.isnullAlias of isna.
DataFrame.notnaBoolean inverse of isna.
DataFrame.dropnaOmit axes labels with missing values.
isnaTop-level isna.
Examples
Show which entries in a DataFrame are NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.nan], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.isna() age born name toy 0 False True False True 1 False False False False 2 True False False False
Show which entries in a Series are NA.
>>> ser = pd.Series([5, 6, np.nan]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.isna() 0 False 1 False 2 True dtype: bool
- isnull() DataFrame
DataFrame.isnull is an alias for DataFrame.isna.
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True).- Returns:
Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.
- Return type:
See also
DataFrame.isnullAlias of isna.
DataFrame.notnaBoolean inverse of isna.
DataFrame.dropnaOmit axes labels with missing values.
isnaTop-level isna.
Examples
Show which entries in a DataFrame are NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.nan], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.isna() age born name toy 0 False True False True 1 False False False False 2 True False False False
Show which entries in a Series are NA.
>>> ser = pd.Series([5, 6, np.nan]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.isna() 0 False 1 False 2 True dtype: bool
- items() Iterable[tuple[Hashable, Series]]
Iterate over (column name, Series) pairs.
Iterates over the DataFrame columns, returning a tuple with the column name and the content as a Series.
- Yields:
label (object) – The column names for the DataFrame being iterated over.
content (Series) – The column entries belonging to each label, as a Series.
See also
DataFrame.iterrowsIterate over DataFrame rows as (index, Series) pairs.
DataFrame.itertuplesIterate over DataFrame rows as namedtuples of the values.
Examples
>>> df = pd.DataFrame({'species': ['bear', 'bear', 'marsupial'], ... 'population': [1864, 22000, 80000]}, ... index=['panda', 'polar', 'koala']) >>> df species population panda bear 1864 polar bear 22000 koala marsupial 80000 >>> for label, content in df.items(): ... print(f'label: {label}') ... print(f'content: {content}', sep='\n') ... label: species content: panda bear polar bear koala marsupial Name: species, dtype: object label: population content: panda 1864 polar 22000 koala 80000 Name: population, dtype: int64
- iterrows() Iterable[tuple[Hashable, Series]]
Iterate over DataFrame rows as (index, Series) pairs.
- Yields:
index (label or tuple of label) – The index of the row. A tuple for a MultiIndex.
data (Series) – The data of the row as a Series.
See also
DataFrame.itertuplesIterate over DataFrame rows as namedtuples of the values.
DataFrame.itemsIterate over (column name, Series) pairs.
Notes
Because
iterrowsreturns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames).To preserve dtypes while iterating over the rows, it is better to use
itertuples()which returns namedtuples of the values and which is generally faster thaniterrows.You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
Examples
>>> df = pd.DataFrame([[1, 1.5]], columns=['int', 'float']) >>> row = next(df.iterrows())[1] >>> row int 1.0 float 1.5 Name: 0, dtype: float64 >>> print(row['int'].dtype) float64 >>> print(df['int'].dtype) int64
- itertuples(index: bool = True, name: str | None = 'Pandas') Iterable[tuple[Any, ...]]
Iterate over DataFrame rows as namedtuples.
- Parameters:
index (bool, default True) – If True, return the index as the first element of the tuple.
name (str or None, default "Pandas") – The name of the returned namedtuples or None to return regular tuples.
- Returns:
An object to iterate over namedtuples for each row in the DataFrame with the first field possibly being the index and following fields being the column values.
- Return type:
iterator
See also
DataFrame.iterrowsIterate over DataFrame rows as (index, Series) pairs.
DataFrame.itemsIterate over (column name, Series) pairs.
Notes
The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore.
Examples
>>> df = pd.DataFrame({'num_legs': [4, 2], 'num_wings': [0, 2]}, ... index=['dog', 'hawk']) >>> df num_legs num_wings dog 4 0 hawk 2 2 >>> for row in df.itertuples(): ... print(row) ... Pandas(Index='dog', num_legs=4, num_wings=0) Pandas(Index='hawk', num_legs=2, num_wings=2)
By setting the index parameter to False we can remove the index as the first element of the tuple:
>>> for row in df.itertuples(index=False): ... print(row) ... Pandas(num_legs=4, num_wings=0) Pandas(num_legs=2, num_wings=2)
With the name parameter set we set a custom name for the yielded namedtuples:
>>> for row in df.itertuples(name='Animal'): ... print(row) ... Animal(Index='dog', num_legs=4, num_wings=0) Animal(Index='hawk', num_legs=2, num_wings=2)
- join(other: DataFrame | Series | Iterable[DataFrame | Series], on: IndexLabel | None = None, how: MergeHow = 'left', lsuffix: str = '', rsuffix: str = '', sort: bool = False, validate: JoinValidate | None = None) DataFrame
Join columns of another DataFrame.
Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.
- Parameters:
other (DataFrame, Series, or a list containing any combination of them) – Index should be similar to one of the columns in this one. If a Series is passed, its name attribute must be set, and that will be used as the column name in the resulting joined DataFrame.
on (str, list of str, or array-like, optional) – Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiple values given, the other DataFrame must have a MultiIndex. Can pass an array as the join key if it is not already contained in the calling DataFrame. Like an Excel VLOOKUP operation.
how ({'left', 'right', 'outer', 'inner', 'cross'}, default 'left') –
How to handle the operation of the two objects.
left: use calling frame’s index (or column if on is specified)
right: use other’s index.
outer: form union of calling frame’s index (or column if on is specified) with other’s index, and sort it lexicographically.
inner: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order of the calling’s one.
cross: creates the cartesian product from both frames, preserves the order of the left keys.
lsuffix (str, default '') – Suffix to use from left frame’s overlapping columns.
rsuffix (str, default '') – Suffix to use from right frame’s overlapping columns.
sort (bool, default False) – Order result DataFrame lexicographically by the join key. If False, the order of the join key depends on the join type (how keyword).
validate (str, optional) –
If specified, checks if join is of specified type.
”one_to_one” or “1:1”: check if join keys are unique in both left and right datasets.
”one_to_many” or “1:m”: check if join keys are unique in left dataset.
”many_to_one” or “m:1”: check if join keys are unique in right dataset.
”many_to_many” or “m:m”: allowed, but does not result in checks.
Added in version 1.5.0.
- Returns:
A dataframe containing columns from both the caller and other.
- Return type:
See also
DataFrame.mergeFor column(s)-on-column(s) operations.
Notes
Parameters on, lsuffix, and rsuffix are not supported when passing a list of DataFrame objects.
Examples
>>> df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'], ... 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
>>> df key A 0 K0 A0 1 K1 A1 2 K2 A2 3 K3 A3 4 K4 A4 5 K5 A5
>>> other = pd.DataFrame({'key': ['K0', 'K1', 'K2'], ... 'B': ['B0', 'B1', 'B2']})
>>> other key B 0 K0 B0 1 K1 B1 2 K2 B2
Join DataFrames using their indexes.
>>> df.join(other, lsuffix='_caller', rsuffix='_other') key_caller A key_other B 0 K0 A0 K0 B0 1 K1 A1 K1 B1 2 K2 A2 K2 B2 3 K3 A3 NaN NaN 4 K4 A4 NaN NaN 5 K5 A5 NaN NaN
If we want to join using the key columns, we need to set key to be the index in both df and other. The joined DataFrame will have key as its index.
>>> df.set_index('key').join(other.set_index('key')) A B key K0 A0 B0 K1 A1 B1 K2 A2 B2 K3 A3 NaN K4 A4 NaN K5 A5 NaN
Another option to join using the key columns is to use the on parameter. DataFrame.join always uses other’s index but we can use any column in df. This method preserves the original DataFrame’s index in the result.
>>> df.join(other.set_index('key'), on='key') key A B 0 K0 A0 B0 1 K1 A1 B1 2 K2 A2 B2 3 K3 A3 NaN 4 K4 A4 NaN 5 K5 A5 NaN
Using non-unique key values shows how they are matched.
>>> df = pd.DataFrame({'key': ['K0', 'K1', 'K1', 'K3', 'K0', 'K1'], ... 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
>>> df key A 0 K0 A0 1 K1 A1 2 K1 A2 3 K3 A3 4 K0 A4 5 K1 A5
>>> df.join(other.set_index('key'), on='key', validate='m:1') key A B 0 K0 A0 B0 1 K1 A1 B1 2 K1 A2 B1 3 K3 A3 NaN 4 K0 A4 B0 5 K1 A5 B1
- kurt(axis: Axis | None = 0, skipna: bool = True, numeric_only: bool = False, **kwargs)
Return unbiased kurtosis over requested axis.
Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.
- Parameters:
axis ({index (0), columns (1)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.Added in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Returns:
Examples
>>> s = pd.Series([1, 2, 2, 3], index=['cat', 'dog', 'dog', 'mouse']) >>> s cat 1 dog 2 dog 2 mouse 3 dtype: int64 >>> s.kurt() 1.5
With a DataFrame
>>> df = pd.DataFrame({'a': [1, 2, 2, 3], 'b': [3, 4, 4, 4]}, ... index=['cat', 'dog', 'dog', 'mouse']) >>> df a b cat 1 3 dog 2 4 dog 2 4 mouse 3 4 >>> df.kurt() a 1.5 b 4.0 dtype: float64
With axis=None
>>> df.kurt(axis=None).round(6) -0.988693
Using axis=1
>>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [3, 4], 'd': [1, 2]}, ... index=['cat', 'dog']) >>> df.kurt(axis=1) cat -6.0 dog -6.0 dtype: float64
- Return type:
Series or scalar
- kurtosis(axis: Axis | None = 0, skipna: bool = True, numeric_only: bool = False, **kwargs)
Return unbiased kurtosis over requested axis.
Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.
- Parameters:
axis ({index (0), columns (1)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.Added in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Returns:
Examples
>>> s = pd.Series([1, 2, 2, 3], index=['cat', 'dog', 'dog', 'mouse']) >>> s cat 1 dog 2 dog 2 mouse 3 dtype: int64 >>> s.kurt() 1.5
With a DataFrame
>>> df = pd.DataFrame({'a': [1, 2, 2, 3], 'b': [3, 4, 4, 4]}, ... index=['cat', 'dog', 'dog', 'mouse']) >>> df a b cat 1 3 dog 2 4 dog 2 4 mouse 3 4 >>> df.kurt() a 1.5 b 4.0 dtype: float64
With axis=None
>>> df.kurt(axis=None).round(6) -0.988693
Using axis=1
>>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [3, 4], 'd': [1, 2]}, ... index=['cat', 'dog']) >>> df.kurt(axis=1) cat -6.0 dog -6.0 dtype: float64
- Return type:
Series or scalar
- le(other, axis: Axis = 'columns', level=None) DataFrame
Get Less than or equal to of dataframe and other, element-wise (binary operator le).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
- Parameters:
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- Returns:
Result of the comparison.
- Return type:
DataFrame of bool
See also
DataFrame.eqCompare DataFrames for equality elementwise.
DataFrame.neCompare DataFrames for inequality elementwise.
DataFrame.leCompare DataFrames for less than inequality or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 cost revenue A False True B False False C True False
>>> df.eq(100) cost revenue A False True B False False C True False
When other is a
Series, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150
>>> df.gt(other) cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
- lt(other, axis: Axis = 'columns', level=None) DataFrame
Get Less than of dataframe and other, element-wise (binary operator lt).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
- Parameters:
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- Returns:
Result of the comparison.
- Return type:
DataFrame of bool
See also
DataFrame.eqCompare DataFrames for equality elementwise.
DataFrame.neCompare DataFrames for inequality elementwise.
DataFrame.leCompare DataFrames for less than inequality or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 cost revenue A False True B False False C True False
>>> df.eq(100) cost revenue A False True B False False C True False
When other is a
Series, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150
>>> df.gt(other) cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
- map(func: PythonFuncType, na_action: str | None = None, **kwargs) DataFrame
Apply a function to a Dataframe elementwise.
Added in version 2.1.0: DataFrame.applymap was deprecated and renamed to DataFrame.map.
This method applies a function that accepts and returns a scalar to every element of a DataFrame.
- Parameters:
func (callable) – Python function, returns a single value from a single value.
na_action ({None, 'ignore'}, default None) – If ‘ignore’, propagate NaN values, without passing them to func.
**kwargs – Additional keyword arguments to pass as keywords arguments to func.
- Returns:
Transformed DataFrame.
- Return type:
See also
DataFrame.applyApply a function along input axis of DataFrame.
DataFrame.replaceReplace values given in to_replace with value.
Series.mapApply a function elementwise on a Series.
Examples
>>> df = pd.DataFrame([[1, 2.12], [3.356, 4.567]]) >>> df 0 1 0 1.000 2.120 1 3.356 4.567
>>> df.map(lambda x: len(str(x))) 0 1 0 3 4 1 5 5
Like Series.map, NA values can be ignored:
>>> df_copy = df.copy() >>> df_copy.iloc[0, 0] = pd.NA >>> df_copy.map(lambda x: len(str(x)), na_action='ignore') 0 1 0 NaN 4 1 5.0 5
It is also possible to use map with functions that are not lambda functions:
>>> df.map(round, ndigits=1) 0 1 0 1.0 2.1 1 3.4 4.6
Note that a vectorized version of func often exists, which will be much faster. You could square each number elementwise.
>>> df.map(lambda x: x**2) 0 1 0 1.000000 4.494400 1 11.262736 20.857489
But it’s better to avoid map in that case.
>>> df ** 2 0 1 0 1.000000 4.494400 1 11.262736 20.857489
- max(axis: Axis | None = 0, skipna: bool = True, numeric_only: bool = False, **kwargs)
Return the maximum of the values over the requested axis.
If you want the index of the maximum, use
idxmax. This is the equivalent of thenumpy.ndarraymethodargmax.- Parameters:
axis ({index (0), columns (1)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.Added in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
Series or scalar
See also
Series.sumReturn the sum.
Series.minReturn the minimum.
Series.maxReturn the maximum.
Series.idxminReturn the index of the minimum.
Series.idxmaxReturn the index of the maximum.
DataFrame.sumReturn the sum over the requested axis.
DataFrame.minReturn the minimum over the requested axis.
DataFrame.maxReturn the maximum over the requested axis.
DataFrame.idxminReturn the index of the minimum over the requested axis.
DataFrame.idxmaxReturn the index of the maximum over the requested axis.
Examples
>>> idx = pd.MultiIndex.from_arrays([ ... ['warm', 'warm', 'cold', 'cold'], ... ['dog', 'falcon', 'fish', 'spider']], ... names=['blooded', 'animal']) >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> s blooded animal warm dog 4 falcon 2 cold fish 0 spider 8 Name: legs, dtype: int64
>>> s.max() 8
- mean(axis: Axis | None = 0, skipna: bool = True, numeric_only: bool = False, **kwargs)
Return the mean of the values over the requested axis.
- Parameters:
axis ({index (0), columns (1)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.Added in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Returns:
Examples
>>> s = pd.Series([1, 2, 3]) >>> s.mean() 2.0
With a DataFrame
>>> df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['tiger', 'zebra']) >>> df a b tiger 1 2 zebra 2 3 >>> df.mean() a 1.5 b 2.5 dtype: float64
Using axis=1
>>> df.mean(axis=1) tiger 1.5 zebra 2.5 dtype: float64
In this case, numeric_only should be set to True to avoid getting an error.
>>> df = pd.DataFrame({'a': [1, 2], 'b': ['T', 'Z']}, ... index=['tiger', 'zebra']) >>> df.mean(numeric_only=True) a 1.5 dtype: float64
- Return type:
Series or scalar
- median(axis: Axis | None = 0, skipna: bool = True, numeric_only: bool = False, **kwargs)
Return the median of the values over the requested axis.
- Parameters:
axis ({index (0), columns (1)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.Added in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Returns:
Examples
>>> s = pd.Series([1, 2, 3]) >>> s.median() 2.0
With a DataFrame
>>> df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['tiger', 'zebra']) >>> df a b tiger 1 2 zebra 2 3 >>> df.median() a 1.5 b 2.5 dtype: float64
Using axis=1
>>> df.median(axis=1) tiger 1.5 zebra 2.5 dtype: float64
In this case, numeric_only should be set to True to avoid getting an error.
>>> df = pd.DataFrame({'a': [1, 2], 'b': ['T', 'Z']}, ... index=['tiger', 'zebra']) >>> df.median(numeric_only=True) a 1.5 dtype: float64
- Return type:
Series or scalar
- melt(id_vars=None, value_vars=None, var_name=None, value_name: Hashable = 'value', col_level: Level | None = None, ignore_index: bool = True) DataFrame
Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
- Parameters:
id_vars (scalar, tuple, list, or ndarray, optional) – Column(s) to use as identifier variables.
value_vars (scalar, tuple, list, or ndarray, optional) – Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
var_name (scalar, default None) – Name to use for the ‘variable’ column. If None it uses
frame.columns.nameor ‘variable’.value_name (scalar, default 'value') – Name to use for the ‘value’ column, can’t be an existing column label.
col_level (scalar, optional) – If columns are a MultiIndex then use this level to melt.
ignore_index (bool, default True) – If True, original index is ignored. If False, the original index is retained. Index labels will be repeated as necessary.
- Returns:
Unpivoted DataFrame.
- Return type:
See also
meltIdentical method.
pivot_tableCreate a spreadsheet-style pivot table as a DataFrame.
DataFrame.pivotReturn reshaped DataFrame organized by given index / column values.
DataFrame.explodeExplode a DataFrame from list-like columns to long format.
Notes
Reference the user guide for more examples.
Examples
>>> df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'}, ... 'B': {0: 1, 1: 3, 2: 5}, ... 'C': {0: 2, 1: 4, 2: 6}}) >>> df A B C 0 a 1 2 1 b 3 4 2 c 5 6
>>> df.melt(id_vars=['A'], value_vars=['B']) A variable value 0 a B 1 1 b B 3 2 c B 5
>>> df.melt(id_vars=['A'], value_vars=['B', 'C']) A variable value 0 a B 1 1 b B 3 2 c B 5 3 a C 2 4 b C 4 5 c C 6
The names of ‘variable’ and ‘value’ columns can be customized:
>>> df.melt(id_vars=['A'], value_vars=['B'], ... var_name='myVarname', value_name='myValname') A myVarname myValname 0 a B 1 1 b B 3 2 c B 5
Original index values can be kept around:
>>> df.melt(id_vars=['A'], value_vars=['B', 'C'], ignore_index=False) A variable value 0 a B 1 1 b B 3 2 c B 5 0 a C 2 1 b C 4 2 c C 6
If you have multi-index columns:
>>> df.columns = [list('ABC'), list('DEF')] >>> df A B C D E F 0 a 1 2 1 b 3 4 2 c 5 6
>>> df.melt(col_level=0, id_vars=['A'], value_vars=['B']) A variable value 0 a B 1 1 b B 3 2 c B 5
>>> df.melt(id_vars=[('A', 'D')], value_vars=[('B', 'E')]) (A, D) variable_0 variable_1 value 0 a B E 1 1 b B E 3 2 c B E 5
- memory_usage(index: bool = True, deep: bool = False) Series
Return the memory usage of each column in bytes.
The memory usage can optionally include the contribution of the index and elements of object dtype.
This value is displayed in DataFrame.info by default. This can be suppressed by setting
pandas.options.display.memory_usageto False.- Parameters:
index (bool, default True) – Specifies whether to include the memory usage of the DataFrame’s index in returned Series. If
index=True, the memory usage of the index is the first item in the output.deep (bool, default False) – If True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned values.
- Returns:
A Series whose index is the original column names and whose values is the memory usage of each column in bytes.
- Return type:
See also
numpy.ndarray.nbytesTotal bytes consumed by the elements of an ndarray.
Series.memory_usageBytes consumed by a Series.
CategoricalMemory-efficient array for string values with many repeated values.
DataFrame.infoConcise summary of a DataFrame.
Notes
See the Frequently Asked Questions for more details.
Examples
>>> dtypes = ['int64', 'float64', 'complex128', 'object', 'bool'] >>> data = dict([(t, np.ones(shape=5000, dtype=int).astype(t)) ... for t in dtypes]) >>> df = pd.DataFrame(data) >>> df.head() int64 float64 complex128 object bool 0 1 1.0 1.0+0.0j 1 True 1 1 1.0 1.0+0.0j 1 True 2 1 1.0 1.0+0.0j 1 True 3 1 1.0 1.0+0.0j 1 True 4 1 1.0 1.0+0.0j 1 True
>>> df.memory_usage() Index 128 int64 40000 float64 40000 complex128 80000 object 40000 bool 5000 dtype: int64
>>> df.memory_usage(index=False) int64 40000 float64 40000 complex128 80000 object 40000 bool 5000 dtype: int64
The memory footprint of object dtype columns is ignored by default:
>>> df.memory_usage(deep=True) Index 128 int64 40000 float64 40000 complex128 80000 object 180000 bool 5000 dtype: int64
Use a Categorical for efficient storage of an object-dtype column with many repeated values.
>>> df['object'].astype('category').memory_usage(deep=True) 5244
- merge(right: DataFrame | Series, how: MergeHow = 'inner', on: IndexLabel | AnyArrayLike | None = None, left_on: IndexLabel | AnyArrayLike | None = None, right_on: IndexLabel | AnyArrayLike | None = None, left_index: bool = False, right_index: bool = False, sort: bool = False, suffixes: Suffixes = ('_x', '_y'), copy: bool | None = None, indicator: str | bool = False, validate: MergeValidate | None = None) DataFrame
Merge DataFrame or named Series objects with a database-style join.
A named Series object is treated as a DataFrame with a single named column.
The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.
Warning
If both key columns contain rows where the key is a null value, those rows will be matched against each other. This is different from usual SQL join behaviour and can lead to unexpected results.
- Parameters:
right (DataFrame or named Series) – Object to merge with.
how ({'left', 'right', 'outer', 'inner', 'cross'}, default 'inner') –
Type of merge to be performed.
left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
cross: creates the cartesian product from both frames, preserves the order of the left keys.
on (label or list) – Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.
left_on (label or list, or array-like) – Column or index level names to join on in the left DataFrame. Can also be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns.
right_on (label or list, or array-like) – Column or index level names to join on in the right DataFrame. Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.
left_index (bool, default False) – Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels.
right_index (bool, default False) – Use the index from the right DataFrame as the join key. Same caveats as left_index.
sort (bool, default False) – Sort the join keys lexicographically in the result DataFrame. If False, the order of the join keys depends on the join type (how keyword).
suffixes (list-like, default is ("_x", "_y")) – A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively. Pass a value of None instead of a string to indicate that the column name from left or right should be left as-is, with no suffix. At least one of the values must not be None.
copy (bool, default True) –
If False, avoid copy if possible.
Note
The copy keyword will change behavior in pandas 3.0. Copy-on-Write will be enabled by default, which means that all methods with a copy keyword will use a lazy copy mechanism to defer the copy and ignore the copy keyword. The copy keyword will be removed in a future version of pandas.
You can already get the future behavior and improvements through enabling copy on write
pd.options.mode.copy_on_write = Trueindicator (bool or str, default False) – If True, adds a column to the output DataFrame called “_merge” with information on the source of each row. The column can be given a different name by providing a string argument. The column will have a Categorical type with the value of “left_only” for observations whose merge key only appears in the left DataFrame, “right_only” for observations whose merge key only appears in the right DataFrame, and “both” if the observation’s merge key is found in both DataFrames.
validate (str, optional) –
If specified, checks if merge is of specified type.
”one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
”one_to_many” or “1:m”: check if merge keys are unique in left dataset.
”many_to_one” or “m:1”: check if merge keys are unique in right dataset.
”many_to_many” or “m:m”: allowed, but does not result in checks.
- Returns:
A DataFrame of the two merged objects.
- Return type:
See also
merge_orderedMerge with optional filling/interpolation.
merge_asofMerge on nearest keys.
DataFrame.joinSimilar method using indices.
Examples
>>> df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'], ... 'value': [1, 2, 3, 5]}) >>> df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'], ... 'value': [5, 6, 7, 8]}) >>> df1 lkey value 0 foo 1 1 bar 2 2 baz 3 3 foo 5 >>> df2 rkey value 0 foo 5 1 bar 6 2 baz 7 3 foo 8
Merge df1 and df2 on the lkey and rkey columns. The value columns have the default suffixes, _x and _y, appended.
>>> df1.merge(df2, left_on='lkey', right_on='rkey') lkey value_x rkey value_y 0 foo 1 foo 5 1 foo 1 foo 8 2 bar 2 bar 6 3 baz 3 baz 7 4 foo 5 foo 5 5 foo 5 foo 8
Merge DataFrames df1 and df2 with specified left and right suffixes appended to any overlapping columns.
>>> df1.merge(df2, left_on='lkey', right_on='rkey', ... suffixes=('_left', '_right')) lkey value_left rkey value_right 0 foo 1 foo 5 1 foo 1 foo 8 2 bar 2 bar 6 3 baz 3 baz 7 4 foo 5 foo 5 5 foo 5 foo 8
Merge DataFrames df1 and df2, but raise an exception if the DataFrames have any overlapping columns.
>>> df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=(False, False)) Traceback (most recent call last): ... ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
>>> df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]}) >>> df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]}) >>> df1 a b 0 foo 1 1 bar 2 >>> df2 a c 0 foo 3 1 baz 4
>>> df1.merge(df2, how='inner', on='a') a b c 0 foo 1 3
>>> df1.merge(df2, how='left', on='a') a b c 0 foo 1 3.0 1 bar 2 NaN
>>> df1 = pd.DataFrame({'left': ['foo', 'bar']}) >>> df2 = pd.DataFrame({'right': [7, 8]}) >>> df1 left 0 foo 1 bar >>> df2 right 0 7 1 8
>>> df1.merge(df2, how='cross') left right 0 foo 7 1 foo 8 2 bar 7 3 bar 8
- min(axis: Axis | None = 0, skipna: bool = True, numeric_only: bool = False, **kwargs)
Return the minimum of the values over the requested axis.
If you want the index of the minimum, use
idxmin. This is the equivalent of thenumpy.ndarraymethodargmin.- Parameters:
axis ({index (0), columns (1)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.Added in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
Series or scalar
See also
Series.sumReturn the sum.
Series.minReturn the minimum.
Series.maxReturn the maximum.
Series.idxminReturn the index of the minimum.
Series.idxmaxReturn the index of the maximum.
DataFrame.sumReturn the sum over the requested axis.
DataFrame.minReturn the minimum over the requested axis.
DataFrame.maxReturn the maximum over the requested axis.
DataFrame.idxminReturn the index of the minimum over the requested axis.
DataFrame.idxmaxReturn the index of the maximum over the requested axis.
Examples
>>> idx = pd.MultiIndex.from_arrays([ ... ['warm', 'warm', 'cold', 'cold'], ... ['dog', 'falcon', 'fish', 'spider']], ... names=['blooded', 'animal']) >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> s blooded animal warm dog 4 falcon 2 cold fish 0 spider 8 Name: legs, dtype: int64
>>> s.min() 0
- mod(other, axis: Axis = 'columns', level=None, fill_value=None) DataFrame
Get Modulo of dataframe and other, element-wise (binary operator mod).
Equivalent to
dataframe % other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmod.Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- mode(axis: Axis = 0, numeric_only: bool = False, dropna: bool = True) DataFrame
Get the mode(s) of each element along the selected axis.
The mode of a set of values is the value that appears most often. It can be multiple values.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) –
The axis to iterate over while searching for the mode:
0 or ‘index’ : get mode of each column
1 or ‘columns’ : get mode of each row.
numeric_only (bool, default False) – If True, only apply to numeric columns.
dropna (bool, default True) – Don’t consider counts of NaN/NaT.
- Returns:
The modes of each column or row.
- Return type:
See also
Series.modeReturn the highest frequency value in a Series.
Series.value_countsReturn the counts of values in a Series.
Examples
>>> df = pd.DataFrame([('bird', 2, 2), ... ('mammal', 4, np.nan), ... ('arthropod', 8, 0), ... ('bird', 2, np.nan)], ... index=('falcon', 'horse', 'spider', 'ostrich'), ... columns=('species', 'legs', 'wings')) >>> df species legs wings falcon bird 2 2.0 horse mammal 4 NaN spider arthropod 8 0.0 ostrich bird 2 NaN
By default, missing values are not considered, and the mode of wings are both 0 and 2. Because the resulting DataFrame has two rows, the second row of
speciesandlegscontainsNaN.>>> df.mode() species legs wings 0 bird 2.0 0.0 1 NaN NaN 2.0
Setting
dropna=FalseNaNvalues are considered and they can be the mode (like for wings).>>> df.mode(dropna=False) species legs wings 0 bird 2 NaN
Setting
numeric_only=True, only the mode of numeric columns is computed, and columns of other types are ignored.>>> df.mode(numeric_only=True) legs wings 0 2.0 0.0 1 NaN 2.0
To compute the mode over columns and not rows, use the axis parameter:
>>> df.mode(axis='columns', numeric_only=True) 0 1 falcon 2.0 NaN horse 4.0 NaN spider 0.0 8.0 ostrich 2.0 NaN
- mul(other, axis: Axis = 'columns', level=None, fill_value=None) DataFrame
Get Multiplication of dataframe and other, element-wise (binary operator mul).
Equivalent to
dataframe * other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmul.Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- multiply(other, axis: Axis = 'columns', level=None, fill_value=None) DataFrame
Get Multiplication of dataframe and other, element-wise (binary operator mul).
Equivalent to
dataframe * other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmul.Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- ne(other, axis: Axis = 'columns', level=None) DataFrame
Get Not equal to of dataframe and other, element-wise (binary operator ne).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
- Parameters:
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- Returns:
Result of the comparison.
- Return type:
DataFrame of bool
See also
DataFrame.eqCompare DataFrames for equality elementwise.
DataFrame.neCompare DataFrames for inequality elementwise.
DataFrame.leCompare DataFrames for less than inequality or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than inequality elementwise.
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 cost revenue A False True B False False C True False
>>> df.eq(100) cost revenue A False True B False False C True False
When other is a
Series, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150
>>> df.gt(other) cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
- nlargest(n: int, columns: IndexLabel, keep: NsmallestNlargestKeep = 'first') DataFrame
Return the first n rows ordered by columns in descending order.
Return the first n rows with the largest values in columns, in descending order. The columns that are not specified are returned as well, but not used for ordering.
This method is equivalent to
df.sort_values(columns, ascending=False).head(n), but more performant.- Parameters:
n (int) – Number of rows to return.
columns (label or list of labels) – Column label(s) to order by.
keep ({'first', 'last', 'all'}, default 'first') –
Where there are duplicate values:
first: prioritize the first occurrence(s)last: prioritize the last occurrence(s)all: keep all the ties of the smallest item even if it means selecting more thannitems.
- Returns:
The first n rows ordered by the given columns in descending order.
- Return type:
See also
DataFrame.nsmallestReturn the first n rows ordered by columns in ascending order.
DataFrame.sort_valuesSort DataFrame by the values.
DataFrame.headReturn the first n rows without re-ordering.
Notes
This function cannot be used with all column types. For example, when specifying columns with object or category dtypes,
TypeErroris raised.Examples
>>> df = pd.DataFrame({'population': [59000000, 65000000, 434000, ... 434000, 434000, 337000, 11300, ... 11300, 11300], ... 'GDP': [1937894, 2583560 , 12011, 4520, 12128, ... 17036, 182, 38, 311], ... 'alpha-2': ["IT", "FR", "MT", "MV", "BN", ... "IS", "NR", "TV", "AI"]}, ... index=["Italy", "France", "Malta", ... "Maldives", "Brunei", "Iceland", ... "Nauru", "Tuvalu", "Anguilla"]) >>> df population GDP alpha-2 Italy 59000000 1937894 IT France 65000000 2583560 FR Malta 434000 12011 MT Maldives 434000 4520 MV Brunei 434000 12128 BN Iceland 337000 17036 IS Nauru 11300 182 NR Tuvalu 11300 38 TV Anguilla 11300 311 AI
In the following example, we will use
nlargestto select the three rows having the largest values in column “population”.>>> df.nlargest(3, 'population') population GDP alpha-2 France 65000000 2583560 FR Italy 59000000 1937894 IT Malta 434000 12011 MT
When using
keep='last', ties are resolved in reverse order:>>> df.nlargest(3, 'population', keep='last') population GDP alpha-2 France 65000000 2583560 FR Italy 59000000 1937894 IT Brunei 434000 12128 BN
When using
keep='all', the number of element kept can go beyondnif there are duplicate values for the smallest element, all the ties are kept:>>> df.nlargest(3, 'population', keep='all') population GDP alpha-2 France 65000000 2583560 FR Italy 59000000 1937894 IT Malta 434000 12011 MT Maldives 434000 4520 MV Brunei 434000 12128 BN
However,
nlargestdoes not keepndistinct largest elements:>>> df.nlargest(5, 'population', keep='all') population GDP alpha-2 France 65000000 2583560 FR Italy 59000000 1937894 IT Malta 434000 12011 MT Maldives 434000 4520 MV Brunei 434000 12128 BN
To order by the largest values in column “population” and then “GDP”, we can specify multiple columns like in the next example.
>>> df.nlargest(3, ['population', 'GDP']) population GDP alpha-2 France 65000000 2583560 FR Italy 59000000 1937894 IT Brunei 434000 12128 BN
- notna() DataFrame
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings
''ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True). NA values, such as None ornumpy.NaN, get mapped to False values.- Returns:
Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.
- Return type:
See also
DataFrame.notnullAlias of notna.
DataFrame.isnaBoolean inverse of notna.
DataFrame.dropnaOmit axes labels with missing values.
notnaTop-level notna.
Examples
Show which entries in a DataFrame are not NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.nan], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.notna() age born name toy 0 True False True False 1 True True True True 2 False True True True
Show which entries in a Series are not NA.
>>> ser = pd.Series([5, 6, np.nan]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.notna() 0 True 1 True 2 False dtype: bool
- notnull() DataFrame
DataFrame.notnull is an alias for DataFrame.notna.
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings
''ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True). NA values, such as None ornumpy.NaN, get mapped to False values.- Returns:
Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.
- Return type:
See also
DataFrame.notnullAlias of notna.
DataFrame.isnaBoolean inverse of notna.
DataFrame.dropnaOmit axes labels with missing values.
notnaTop-level notna.
Examples
Show which entries in a DataFrame are not NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.nan], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.notna() age born name toy 0 True False True False 1 True True True True 2 False True True True
Show which entries in a Series are not NA.
>>> ser = pd.Series([5, 6, np.nan]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.notna() 0 True 1 True 2 False dtype: bool
- nsmallest(n: int, columns: IndexLabel, keep: NsmallestNlargestKeep = 'first') DataFrame
Return the first n rows ordered by columns in ascending order.
Return the first n rows with the smallest values in columns, in ascending order. The columns that are not specified are returned as well, but not used for ordering.
This method is equivalent to
df.sort_values(columns, ascending=True).head(n), but more performant.- Parameters:
n (int) – Number of items to retrieve.
columns (list or str) – Column name or names to order by.
keep ({'first', 'last', 'all'}, default 'first') –
Where there are duplicate values:
first: take the first occurrence.last: take the last occurrence.all: keep all the ties of the largest item even if it means selecting more thannitems.
- Return type:
See also
DataFrame.nlargestReturn the first n rows ordered by columns in descending order.
DataFrame.sort_valuesSort DataFrame by the values.
DataFrame.headReturn the first n rows without re-ordering.
Examples
>>> df = pd.DataFrame({'population': [59000000, 65000000, 434000, ... 434000, 434000, 337000, 337000, ... 11300, 11300], ... 'GDP': [1937894, 2583560 , 12011, 4520, 12128, ... 17036, 182, 38, 311], ... 'alpha-2': ["IT", "FR", "MT", "MV", "BN", ... "IS", "NR", "TV", "AI"]}, ... index=["Italy", "France", "Malta", ... "Maldives", "Brunei", "Iceland", ... "Nauru", "Tuvalu", "Anguilla"]) >>> df population GDP alpha-2 Italy 59000000 1937894 IT France 65000000 2583560 FR Malta 434000 12011 MT Maldives 434000 4520 MV Brunei 434000 12128 BN Iceland 337000 17036 IS Nauru 337000 182 NR Tuvalu 11300 38 TV Anguilla 11300 311 AI
In the following example, we will use
nsmallestto select the three rows having the smallest values in column “population”.>>> df.nsmallest(3, 'population') population GDP alpha-2 Tuvalu 11300 38 TV Anguilla 11300 311 AI Iceland 337000 17036 IS
When using
keep='last', ties are resolved in reverse order:>>> df.nsmallest(3, 'population', keep='last') population GDP alpha-2 Anguilla 11300 311 AI Tuvalu 11300 38 TV Nauru 337000 182 NR
When using
keep='all', the number of element kept can go beyondnif there are duplicate values for the largest element, all the ties are kept.>>> df.nsmallest(3, 'population', keep='all') population GDP alpha-2 Tuvalu 11300 38 TV Anguilla 11300 311 AI Iceland 337000 17036 IS Nauru 337000 182 NR
However,
nsmallestdoes not keepndistinct smallest elements:>>> df.nsmallest(4, 'population', keep='all') population GDP alpha-2 Tuvalu 11300 38 TV Anguilla 11300 311 AI Iceland 337000 17036 IS Nauru 337000 182 NR
To order by the smallest values in column “population” and then “GDP”, we can specify multiple columns like in the next example.
>>> df.nsmallest(3, ['population', 'GDP']) population GDP alpha-2 Tuvalu 11300 38 TV Anguilla 11300 311 AI Nauru 337000 182 NR
- nunique(axis: Axis = 0, dropna: bool = True) Series
Count number of distinct elements in specified axis.
Return Series with number of distinct elements. Can ignore NaN values.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
dropna (bool, default True) – Don’t include NaN in the counts.
- Return type:
See also
Series.nuniqueMethod nunique for Series.
DataFrame.countCount non-NA cells for each column or row.
Examples
>>> df = pd.DataFrame({'A': [4, 5, 6], 'B': [4, 1, 1]}) >>> df.nunique() A 3 B 2 dtype: int64
>>> df.nunique(axis=1) 0 1 1 2 2 2 dtype: int64
- pivot(*, columns, index=<no_default>, values=<no_default>) DataFrame
Return reshaped DataFrame organized by given index / column values.
Reshape data (produce a “pivot” table) based on column values. Uses unique values from specified index / columns to form axes of the resulting DataFrame. This function does not support data aggregation, multiple values will result in a MultiIndex in the columns. See the User Guide for more on reshaping.
- Parameters:
columns (str or object or a list of str) – Column to use to make new frame’s columns.
index (str or object or a list of str, optional) – Column to use to make new frame’s index. If not given, uses existing index.
values (str, object or a list of the previous, optional) – Column(s) to use for populating new frame’s values. If not specified, all remaining columns will be used and the result will have hierarchically indexed columns.
- Returns:
Returns reshaped DataFrame.
- Return type:
- Raises:
ValueError: – When there are any index, columns combinations with multiple values. DataFrame.pivot_table when you need to aggregate.
See also
DataFrame.pivot_tableGeneralization of pivot that can handle duplicate values for one index/column pair.
DataFrame.unstackPivot based on the index values instead of a column.
wide_to_longWide panel to long format. Less flexible but more user-friendly than melt.
Notes
For finer-tuned control, see hierarchical indexing documentation along with the related stack/unstack methods.
Reference the user guide for more examples.
Examples
>>> df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two', ... 'two'], ... 'bar': ['A', 'B', 'C', 'A', 'B', 'C'], ... 'baz': [1, 2, 3, 4, 5, 6], ... 'zoo': ['x', 'y', 'z', 'q', 'w', 't']}) >>> df foo bar baz zoo 0 one A 1 x 1 one B 2 y 2 one C 3 z 3 two A 4 q 4 two B 5 w 5 two C 6 t
>>> df.pivot(index='foo', columns='bar', values='baz') bar A B C foo one 1 2 3 two 4 5 6
>>> df.pivot(index='foo', columns='bar')['baz'] bar A B C foo one 1 2 3 two 4 5 6
>>> df.pivot(index='foo', columns='bar', values=['baz', 'zoo']) baz zoo bar A B C A B C foo one 1 2 3 x y z two 4 5 6 q w t
You could also assign a list of column names or a list of index names.
>>> df = pd.DataFrame({ ... "lev1": [1, 1, 1, 2, 2, 2], ... "lev2": [1, 1, 2, 1, 1, 2], ... "lev3": [1, 2, 1, 2, 1, 2], ... "lev4": [1, 2, 3, 4, 5, 6], ... "values": [0, 1, 2, 3, 4, 5]}) >>> df lev1 lev2 lev3 lev4 values 0 1 1 1 1 0 1 1 1 2 2 1 2 1 2 1 3 2 3 2 1 2 4 3 4 2 1 1 5 4 5 2 2 2 6 5
>>> df.pivot(index="lev1", columns=["lev2", "lev3"], values="values") lev2 1 2 lev3 1 2 1 2 lev1 1 0.0 1.0 2.0 NaN 2 4.0 3.0 NaN 5.0
>>> df.pivot(index=["lev1", "lev2"], columns=["lev3"], values="values") lev3 1 2 lev1 lev2 1 1 0.0 1.0 2 2.0 NaN 2 1 4.0 3.0 2 NaN 5.0
A ValueError is raised if there are any duplicates.
>>> df = pd.DataFrame({"foo": ['one', 'one', 'two', 'two'], ... "bar": ['A', 'A', 'B', 'C'], ... "baz": [1, 2, 3, 4]}) >>> df foo bar baz 0 one A 1 1 one A 2 2 two B 3 3 two C 4
Notice that the first two rows are the same for our index and columns arguments.
>>> df.pivot(index='foo', columns='bar', values='baz') Traceback (most recent call last): ... ValueError: Index contains duplicate entries, cannot reshape
- pivot_table(values=None, index=None, columns=None, aggfunc: AggFuncType = 'mean', fill_value=None, margins: bool = False, dropna: bool = True, margins_name: Level = 'All', observed: bool | lib.NoDefault = <no_default>, sort: bool = True) DataFrame
Create a spreadsheet-style pivot table as a DataFrame.
The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.
- Parameters:
values (list-like or scalar, optional) – Column or columns to aggregate.
index (column, Grouper, array, or list of the previous) – Keys to group by on the pivot table index. If a list is passed, it can contain any of the other types (except list). If an array is passed, it must be the same length as the data and will be used in the same manner as column values.
columns (column, Grouper, array, or list of the previous) – Keys to group by on the pivot table column. If a list is passed, it can contain any of the other types (except list). If an array is passed, it must be the same length as the data and will be used in the same manner as column values.
aggfunc (function, list of functions, dict, default "mean") – If a list of functions is passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves). If a dict is passed, the key is column to aggregate and the value is function or list of functions. If
margin=True, aggfunc will be used to calculate the partial aggregates.fill_value (scalar, default None) – Value to replace missing values with (in the resulting pivot table, after aggregation).
margins (bool, default False) – If
margins=True, specialAllcolumns and rows will be added with partial group aggregates across the categories on the rows and columns.dropna (bool, default True) – Do not include columns whose entries are all NaN. If True, rows with a NaN value in any column will be omitted before computing margins.
margins_name (str, default 'All') – Name of the row / column that will contain the totals when margins is True.
observed (bool, default False) –
This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
Deprecated since version 2.2.0: The default value of
Falseis deprecated and will change toTruein a future version of pandas.sort (bool, default True) –
Specifies if the result should be sorted.
Added in version 1.3.0.
- Returns:
An Excel style pivot table.
- Return type:
See also
DataFrame.pivotPivot without aggregation that can handle non-numeric data.
DataFrame.meltUnpivot a DataFrame from wide to long format, optionally leaving identifiers set.
wide_to_longWide panel to long format. Less flexible but more user-friendly than melt.
Notes
Reference the user guide for more examples.
Examples
>>> df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo", ... "bar", "bar", "bar", "bar"], ... "B": ["one", "one", "one", "two", "two", ... "one", "one", "two", "two"], ... "C": ["small", "large", "large", "small", ... "small", "large", "small", "small", ... "large"], ... "D": [1, 2, 2, 3, 3, 4, 5, 6, 7], ... "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]}) >>> df A B C D E 0 foo one small 1 2 1 foo one large 2 4 2 foo one large 2 5 3 foo two small 3 5 4 foo two small 3 6 5 bar one large 4 6 6 bar one small 5 8 7 bar two small 6 9 8 bar two large 7 9
This first example aggregates values by taking the sum.
>>> table = pd.pivot_table(df, values='D', index=['A', 'B'], ... columns=['C'], aggfunc="sum") >>> table C large small A B bar one 4.0 5.0 two 7.0 6.0 foo one 4.0 1.0 two NaN 6.0
We can also fill missing values using the fill_value parameter.
>>> table = pd.pivot_table(df, values='D', index=['A', 'B'], ... columns=['C'], aggfunc="sum", fill_value=0) >>> table C large small A B bar one 4 5 two 7 6 foo one 4 1 two 0 6
The next example aggregates by taking the mean across multiple columns.
>>> table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'], ... aggfunc={'D': "mean", 'E': "mean"}) >>> table D E A C bar large 5.500000 7.500000 small 5.500000 8.500000 foo large 2.000000 4.500000 small 2.333333 4.333333
We can also calculate multiple types of aggregations for any given value column.
>>> table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'], ... aggfunc={'D': "mean", ... 'E': ["min", "max", "mean"]}) >>> table D E mean max mean min A C bar large 5.500000 9 7.500000 6 small 5.500000 9 8.500000 8 foo large 2.000000 5 4.500000 4 small 2.333333 6 4.333333 2
- plot
alias of
PlotAccessor
- pop(item: Hashable) Series
Return item and drop from frame. Raise KeyError if not found.
- Parameters:
item (label) – Label of column to be popped.
- Return type:
Examples
>>> df = pd.DataFrame([('falcon', 'bird', 389.0), ... ('parrot', 'bird', 24.0), ... ('lion', 'mammal', 80.5), ... ('monkey', 'mammal', np.nan)], ... columns=('name', 'class', 'max_speed')) >>> df name class max_speed 0 falcon bird 389.0 1 parrot bird 24.0 2 lion mammal 80.5 3 monkey mammal NaN
>>> df.pop('class') 0 bird 1 bird 2 mammal 3 mammal Name: class, dtype: object
>>> df name max_speed 0 falcon 389.0 1 parrot 24.0 2 lion 80.5 3 monkey NaN
- pow(other, axis: Axis = 'columns', level=None, fill_value=None) DataFrame
Get Exponential power of dataframe and other, element-wise (binary operator pow).
Equivalent to
dataframe ** other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rpow.Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- prod(axis: Axis | None = 0, skipna: bool = True, numeric_only: bool = False, min_count: int = 0, **kwargs)
Return the product of the values over the requested axis.
- Parameters:
axis ({index (0), columns (1)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.prod with
axis=Noneis deprecated, in a future version this will reduce over both axes and return a scalar To retain the old behavior, pass axis=0 (or do not pass axis).Added in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than
min_countnon-NA values are present the result will be NA.**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
Series or scalar
See also
Series.sumReturn the sum.
Series.minReturn the minimum.
Series.maxReturn the maximum.
Series.idxminReturn the index of the minimum.
Series.idxmaxReturn the index of the maximum.
DataFrame.sumReturn the sum over the requested axis.
DataFrame.minReturn the minimum over the requested axis.
DataFrame.maxReturn the maximum over the requested axis.
DataFrame.idxminReturn the index of the minimum over the requested axis.
DataFrame.idxmaxReturn the index of the maximum over the requested axis.
Examples
By default, the product of an empty or all-NA Series is
1>>> pd.Series([], dtype="float64").prod() 1.0
This can be controlled with the
min_countparameter>>> pd.Series([], dtype="float64").prod(min_count=1) nan
Thanks to the
skipnaparameter,min_counthandles all-NA and empty series identically.>>> pd.Series([np.nan]).prod() 1.0
>>> pd.Series([np.nan]).prod(min_count=1) nan
- product(axis: Axis | None = 0, skipna: bool = True, numeric_only: bool = False, min_count: int = 0, **kwargs)
Return the product of the values over the requested axis.
- Parameters:
axis ({index (0), columns (1)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.prod with
axis=Noneis deprecated, in a future version this will reduce over both axes and return a scalar To retain the old behavior, pass axis=0 (or do not pass axis).Added in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than
min_countnon-NA values are present the result will be NA.**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
Series or scalar
See also
Series.sumReturn the sum.
Series.minReturn the minimum.
Series.maxReturn the maximum.
Series.idxminReturn the index of the minimum.
Series.idxmaxReturn the index of the maximum.
DataFrame.sumReturn the sum over the requested axis.
DataFrame.minReturn the minimum over the requested axis.
DataFrame.maxReturn the maximum over the requested axis.
DataFrame.idxminReturn the index of the minimum over the requested axis.
DataFrame.idxmaxReturn the index of the maximum over the requested axis.
Examples
By default, the product of an empty or all-NA Series is
1>>> pd.Series([], dtype="float64").prod() 1.0
This can be controlled with the
min_countparameter>>> pd.Series([], dtype="float64").prod(min_count=1) nan
Thanks to the
skipnaparameter,min_counthandles all-NA and empty series identically.>>> pd.Series([np.nan]).prod() 1.0
>>> pd.Series([np.nan]).prod(min_count=1) nan
- quantile(q: float = 0.5, axis: Axis = 0, numeric_only: bool = False, interpolation: QuantileInterpolation = 'linear', method: Literal['single', 'table'] = 'single') Series
- quantile(q: AnyArrayLike | Sequence[float], axis: Axis = 0, numeric_only: bool = False, interpolation: QuantileInterpolation = 'linear', method: Literal['single', 'table'] = 'single') Series | DataFrame
- quantile(q: float | AnyArrayLike | Sequence[float] = 0.5, axis: Axis = 0, numeric_only: bool = False, interpolation: QuantileInterpolation = 'linear', method: Literal['single', 'table'] = 'single') Series | DataFrame
Return values at the given quantile over requested axis.
- Parameters:
q (float or array-like, default 0.5 (50% quantile)) – Value between 0 <= q <= 1, the quantile(s) to compute.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Equals 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
numeric_only (bool, default False) –
Include only float, int or boolean data.
Changed in version 2.0.0: The default value of
numeric_onlyis nowFalse.interpolation ({'linear', 'lower', 'higher', 'midpoint', 'nearest'}) –
This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:
linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
lower: i.
higher: j.
nearest: i or j whichever is nearest.
midpoint: (i + j) / 2.
method ({'single', 'table'}, default 'single') – Whether to compute quantiles per-column (‘single’) or over all columns (‘table’). When ‘table’, the only allowed interpolation methods are ‘nearest’, ‘lower’, and ‘higher’.
- Returns:
- If
qis an array, a DataFrame will be returned where the index is
q, the columns are the columns of self, and the values are the quantiles.- If
qis a float, a Series will be returned where the index is the columns of self and the values are the quantiles.
- If
- Return type:
See also
core.window.rolling.Rolling.quantileRolling quantile.
numpy.percentileNumpy function to compute the percentile.
Examples
>>> df = pd.DataFrame(np.array([[1, 1], [2, 10], [3, 100], [4, 100]]), ... columns=['a', 'b']) >>> df.quantile(.1) a 1.3 b 3.7 Name: 0.1, dtype: float64 >>> df.quantile([.1, .5]) a b 0.1 1.3 3.7 0.5 2.5 55.0
Specifying method=’table’ will compute the quantile over all columns.
>>> df.quantile(.1, method="table", interpolation="nearest") a 1 b 1 Name: 0.1, dtype: int64 >>> df.quantile([.1, .5], method="table", interpolation="nearest") a b 0.1 1 1 0.5 3 100
Specifying numeric_only=False will also compute the quantile of datetime and timedelta data.
>>> df = pd.DataFrame({'A': [1, 2], ... 'B': [pd.Timestamp('2010'), ... pd.Timestamp('2011')], ... 'C': [pd.Timedelta('1 days'), ... pd.Timedelta('2 days')]}) >>> df.quantile(0.5, numeric_only=False) A 1.5 B 2010-07-02 12:00:00 C 1 days 12:00:00 Name: 0.5, dtype: object
- query(expr: str, *, inplace: Literal[False] = False, **kwargs) DataFrame
- query(expr: str, *, inplace: Literal[True], **kwargs) None
- query(expr: str, *, inplace: bool = False, **kwargs) DataFrame | None
Query the columns of a DataFrame with a boolean expression.
- Parameters:
expr (str) –
The query string to evaluate.
You can refer to variables in the environment by prefixing them with an ‘@’ character like
@a + b.You can refer to column names that are not valid Python variable names by surrounding them in backticks. Thus, column names containing spaces or punctuations (besides underscores) or starting with digits must be surrounded by backticks. (For example, a column named “Area (cm^2)” would be referenced as
`Area (cm^2)`). Column names which are Python keywords (like “list”, “for”, “import”, etc) cannot be used.For example, if one of your columns is called
a aand you want to sum it withb, your query should be`a a` + b.inplace (bool) – Whether to modify the DataFrame rather than creating a new one.
**kwargs – See the documentation for
eval()for complete details on the keyword arguments accepted byDataFrame.query().
- Returns:
DataFrame resulting from the provided query expression or None if
inplace=True.- Return type:
DataFrame or None
See also
evalEvaluate a string describing operations on DataFrame columns.
DataFrame.evalEvaluate a string describing operations on DataFrame columns.
Notes
The result of the evaluation of this expression is first passed to
DataFrame.locand if that fails because of a multidimensional key (e.g., a DataFrame) then the result will be passed toDataFrame.__getitem__().This method uses the top-level
eval()function to evaluate the passed query.The
query()method uses a slightly modified Python syntax by default. For example, the&and|(bitwise) operators have the precedence of their boolean cousins,andandor. This is syntactically valid Python, however the semantics are different.You can change the semantics of the expression by passing the keyword argument
parser='python'. This enforces the same semantics as evaluation in Python space. Likewise, you can passengine='python'to evaluate an expression using Python itself as a backend. This is not recommended as it is inefficient compared to usingnumexpras the engine.The
DataFrame.indexandDataFrame.columnsattributes of theDataFrameinstance are placed in the query namespace by default, which allows you to treat both the index and columns of the frame as a column in the frame. The identifierindexis used for the frame index; you can also use the name of the index to identify it in a query. Please note that Python keywords may not be used as identifiers.For further details and examples see the
querydocumentation in indexing.Backtick quoted variables
Backtick quoted variables are parsed as literal Python code and are converted internally to a Python valid identifier. This can lead to the following problems.
During parsing a number of disallowed characters inside the backtick quoted string are replaced by strings that are allowed as a Python identifier. These characters include all operators in Python, the space character, the question mark, the exclamation mark, the dollar sign, and the euro sign. For other characters that fall outside the ASCII range (U+0001..U+007F) and those that are not further specified in PEP 3131, the query parser will raise an error. This excludes whitespace different than the space character, but also the hashtag (as it is used for comments) and the backtick itself (backtick can also not be escaped).
In a special case, quotes that make a pair around a backtick can confuse the parser. For example,
`it's` > `that's`will raise an error, as it forms a quoted string ('s > `that') with a backtick inside.See also the Python documentation about lexical analysis (https://docs.python.org/3/reference/lexical_analysis.html) in combination with the source code in
pandas.core.computation.parsing.Examples
>>> df = pd.DataFrame({'A': range(1, 6), ... 'B': range(10, 0, -2), ... 'C C': range(10, 5, -1)}) >>> df A B C C 0 1 10 10 1 2 8 9 2 3 6 8 3 4 4 7 4 5 2 6 >>> df.query('A > B') A B C C 4 5 2 6
The previous expression is equivalent to
>>> df[df.A > df.B] A B C C 4 5 2 6
For columns with spaces in their name, you can use backtick quoting.
>>> df.query('B == `C C`') A B C C 0 1 10 10
The previous expression is equivalent to
>>> df[df.B == df['C C']] A B C C 0 1 10 10
- radd(other, axis: Axis = 'columns', level=None, fill_value=None) DataFrame
Get Addition of dataframe and other, element-wise (binary operator radd).
Equivalent to
other + dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, add.Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- rdiv(other, axis: Axis = 'columns', level=None, fill_value=None) DataFrame
Get Floating division of dataframe and other, element-wise (binary operator rtruediv).
Equivalent to
other / dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, truediv.Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- reindex(labels=None, *, index=None, columns=None, axis: Axis | None = None, method: ReindexMethod | None = None, copy: bool | None = None, level: Level | None = None, fill_value: Scalar | None = nan, limit: int | None = None, tolerance=None) DataFrame
Conform DataFrame to new index with optional filling logic.
Places NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and
copy=False.- Parameters:
labels (array-like, optional) – New labels / index to conform the axis specified by ‘axis’ to.
index (array-like, optional) – New labels for the index. Preferably an Index object to avoid duplicating data.
columns (array-like, optional) – New labels for the columns. Preferably an Index object to avoid duplicating data.
axis (int or str, optional) – Axis to target. Can be either the axis name (‘index’, ‘columns’) or number (0, 1).
method ({None, 'backfill'/'bfill', 'pad'/'ffill', 'nearest'}) –
Method to use for filling holes in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index.
None (default): don’t fill gaps
pad / ffill: Propagate last valid observation forward to next valid.
backfill / bfill: Use next valid observation to fill gap.
nearest: Use nearest valid observations to fill gap.
copy (bool, default True) –
Return a new object, even if the passed indexes are the same.
Note
The copy keyword will change behavior in pandas 3.0. Copy-on-Write will be enabled by default, which means that all methods with a copy keyword will use a lazy copy mechanism to defer the copy and ignore the copy keyword. The copy keyword will be removed in a future version of pandas.
You can already get the future behavior and improvements through enabling copy on write
pd.options.mode.copy_on_write = Truelevel (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (scalar, default np.nan) – Value to use for missing values. Defaults to NaN, but can be any “compatible” value.
limit (int, default None) – Maximum number of consecutive elements to forward or backward fill.
tolerance (optional) –
Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation
abs(index[indexer] - target) <= tolerance.Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type.
- Return type:
DataFrame with changed index.
See also
DataFrame.set_indexSet row labels.
DataFrame.reset_indexRemove row labels or move them to new columns.
DataFrame.reindex_likeChange to same indices as other DataFrame.
Examples
DataFrame.reindexsupports two calling conventions(index=index_labels, columns=column_labels, ...)(labels, axis={'index', 'columns'}, ...)
We highly recommend using keyword arguments to clarify your intent.
Create a dataframe with some fictional data.
>>> index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror'] >>> df = pd.DataFrame({'http_status': [200, 200, 404, 404, 301], ... 'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]}, ... index=index) >>> df http_status response_time Firefox 200 0.04 Chrome 200 0.02 Safari 404 0.07 IE10 404 0.08 Konqueror 301 1.00
Create a new index and reindex the dataframe. By default values in the new index that do not have corresponding records in the dataframe are assigned
NaN.>>> new_index = ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10', ... 'Chrome'] >>> df.reindex(new_index) http_status response_time Safari 404.0 0.07 Iceweasel NaN NaN Comodo Dragon NaN NaN IE10 404.0 0.08 Chrome 200.0 0.02
We can fill in the missing values by passing a value to the keyword
fill_value. Because the index is not monotonically increasing or decreasing, we cannot use arguments to the keywordmethodto fill theNaNvalues.>>> df.reindex(new_index, fill_value=0) http_status response_time Safari 404 0.07 Iceweasel 0 0.00 Comodo Dragon 0 0.00 IE10 404 0.08 Chrome 200 0.02
>>> df.reindex(new_index, fill_value='missing') http_status response_time Safari 404 0.07 Iceweasel missing missing Comodo Dragon missing missing IE10 404 0.08 Chrome 200 0.02
We can also reindex the columns.
>>> df.reindex(columns=['http_status', 'user_agent']) http_status user_agent Firefox 200 NaN Chrome 200 NaN Safari 404 NaN IE10 404 NaN Konqueror 301 NaN
Or we can use “axis-style” keyword arguments
>>> df.reindex(['http_status', 'user_agent'], axis="columns") http_status user_agent Firefox 200 NaN Chrome 200 NaN Safari 404 NaN IE10 404 NaN Konqueror 301 NaN
To further illustrate the filling functionality in
reindex, we will create a dataframe with a monotonically increasing index (for example, a sequence of dates).>>> date_index = pd.date_range('1/1/2010', periods=6, freq='D') >>> df2 = pd.DataFrame({"prices": [100, 101, np.nan, 100, 89, 88]}, ... index=date_index) >>> df2 prices 2010-01-01 100.0 2010-01-02 101.0 2010-01-03 NaN 2010-01-04 100.0 2010-01-05 89.0 2010-01-06 88.0
Suppose we decide to expand the dataframe to cover a wider date range.
>>> date_index2 = pd.date_range('12/29/2009', periods=10, freq='D') >>> df2.reindex(date_index2) prices 2009-12-29 NaN 2009-12-30 NaN 2009-12-31 NaN 2010-01-01 100.0 2010-01-02 101.0 2010-01-03 NaN 2010-01-04 100.0 2010-01-05 89.0 2010-01-06 88.0 2010-01-07 NaN
The index entries that did not have a value in the original data frame (for example, ‘2009-12-29’) are by default filled with
NaN. If desired, we can fill in the missing values using one of several options.For example, to back-propagate the last valid value to fill the
NaNvalues, passbfillas an argument to themethodkeyword.>>> df2.reindex(date_index2, method='bfill') prices 2009-12-29 100.0 2009-12-30 100.0 2009-12-31 100.0 2010-01-01 100.0 2010-01-02 101.0 2010-01-03 NaN 2010-01-04 100.0 2010-01-05 89.0 2010-01-06 88.0 2010-01-07 NaN
Please note that the
NaNvalue present in the original dataframe (at index value 2010-01-03) will not be filled by any of the value propagation schemes. This is because filling while reindexing does not look at dataframe values, but only compares the original and desired indexes. If you do want to fill in theNaNvalues present in the original dataframe, use thefillna()method.See the user guide for more.
- rename(mapper: Renamer | None = None, *, index: Renamer | None = None, columns: Renamer | None = None, axis: Axis | None = None, copy: bool | None = None, inplace: Literal[True], level: Level = None, errors: IgnoreRaise = 'ignore') None
- rename(mapper: Renamer | None = None, *, index: Renamer | None = None, columns: Renamer | None = None, axis: Axis | None = None, copy: bool | None = None, inplace: Literal[False] = False, level: Level = None, errors: IgnoreRaise = 'ignore') DataFrame
- rename(mapper: Renamer | None = None, *, index: Renamer | None = None, columns: Renamer | None = None, axis: Axis | None = None, copy: bool | None = None, inplace: bool = False, level: Level = None, errors: IgnoreRaise = 'ignore') DataFrame | None
Rename columns or index labels.
Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.
See the user guide for more.
- Parameters:
mapper (dict-like or function) – Dict-like or function transformations to apply to that axis’ values. Use either
mapperandaxisto specify the axis to target withmapper, orindexandcolumns.index (dict-like or function) – Alternative to specifying axis (
mapper, axis=0is equivalent toindex=mapper).columns (dict-like or function) – Alternative to specifying axis (
mapper, axis=1is equivalent tocolumns=mapper).axis ({0 or 'index', 1 or 'columns'}, default 0) – Axis to target with
mapper. Can be either the axis name (‘index’, ‘columns’) or number (0, 1). The default is ‘index’.copy (bool, default True) –
Also copy underlying data.
Note
The copy keyword will change behavior in pandas 3.0. Copy-on-Write will be enabled by default, which means that all methods with a copy keyword will use a lazy copy mechanism to defer the copy and ignore the copy keyword. The copy keyword will be removed in a future version of pandas.
You can already get the future behavior and improvements through enabling copy on write
pd.options.mode.copy_on_write = Trueinplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one. If True then value of copy is ignored.
level (int or level name, default None) – In case of a MultiIndex, only rename labels in the specified level.
errors ({'ignore', 'raise'}, default 'ignore') – If ‘raise’, raise a KeyError when a dict-like mapper, index, or columns contains labels that are not present in the Index being transformed. If ‘ignore’, existing keys will be renamed and extra keys will be ignored.
- Returns:
DataFrame with the renamed axis labels or None if
inplace=True.- Return type:
DataFrame or None
- Raises:
KeyError – If any of the labels is not found in the selected axis and “errors=’raise’”.
See also
DataFrame.rename_axisSet the name of the axis.
Examples
DataFrame.renamesupports two calling conventions(index=index_mapper, columns=columns_mapper, ...)(mapper, axis={'index', 'columns'}, ...)
We highly recommend using keyword arguments to clarify your intent.
Rename columns using a mapping:
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) >>> df.rename(columns={"A": "a", "B": "c"}) a c 0 1 4 1 2 5 2 3 6
Rename index using a mapping:
>>> df.rename(index={0: "x", 1: "y", 2: "z"}) A B x 1 4 y 2 5 z 3 6
Cast index labels to a different type:
>>> df.index RangeIndex(start=0, stop=3, step=1) >>> df.rename(index=str).index Index(['0', '1', '2'], dtype='object')
>>> df.rename(columns={"A": "a", "B": "b", "C": "c"}, errors="raise") Traceback (most recent call last): KeyError: ['C'] not found in axis
Using axis-style parameters:
>>> df.rename(str.lower, axis='columns') a b 0 1 4 1 2 5 2 3 6
>>> df.rename({1: 2, 2: 4}, axis='index') A B 0 1 4 2 2 5 4 3 6
- reorder_levels(order: Sequence[int | str], axis: Axis = 0) DataFrame
Rearrange index levels using input order. May not drop or duplicate levels.
- Parameters:
order (list of int or list of str) – List representing new level order. Reference level by number (position) or by key (label).
axis ({0 or 'index', 1 or 'columns'}, default 0) – Where to reorder levels.
- Return type:
Examples
>>> data = { ... "class": ["Mammals", "Mammals", "Reptiles"], ... "diet": ["Omnivore", "Carnivore", "Carnivore"], ... "species": ["Humans", "Dogs", "Snakes"], ... } >>> df = pd.DataFrame(data, columns=["class", "diet", "species"]) >>> df = df.set_index(["class", "diet"]) >>> df species class diet Mammals Omnivore Humans Carnivore Dogs Reptiles Carnivore Snakes
Let’s reorder the levels of the index:
>>> df.reorder_levels(["diet", "class"]) species diet class Omnivore Mammals Humans Carnivore Mammals Dogs Reptiles Snakes
- reset_index(level: IndexLabel = None, *, drop: bool = False, inplace: Literal[False] = False, col_level: Hashable = 0, col_fill: Hashable = '', allow_duplicates: bool | lib.NoDefault = <no_default>, names: Hashable | Sequence[Hashable] | None = None) DataFrame
- reset_index(level: IndexLabel = None, *, drop: bool = False, inplace: Literal[True], col_level: Hashable = 0, col_fill: Hashable = '', allow_duplicates: bool | lib.NoDefault = <no_default>, names: Hashable | Sequence[Hashable] | None = None) None
- reset_index(level: IndexLabel = None, *, drop: bool = False, inplace: bool = False, col_level: Hashable = 0, col_fill: Hashable = '', allow_duplicates: bool | lib.NoDefault = <no_default>, names: Hashable | Sequence[Hashable] | None = None) DataFrame | None
Reset the index, or a level of it.
Reset the index of the DataFrame, and use the default one instead. If the DataFrame has a MultiIndex, this method can remove one or more levels.
- Parameters:
level (int, str, tuple, or list, default None) – Only remove the given levels from the index. Removes all levels by default.
drop (bool, default False) – Do not try to insert index into dataframe columns. This resets the index to the default integer index.
inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.
col_level (int or str, default 0) – If the columns have multiple levels, determines which level the labels are inserted into. By default it is inserted into the first level.
col_fill (object, default '') – If the columns have multiple levels, determines how the other levels are named. If None then the index name is repeated.
allow_duplicates (bool, optional, default lib.no_default) –
Allow duplicate column labels to be created.
Added in version 1.5.0.
names (int, str or 1-dimensional list, default None) –
Using the given string, rename the DataFrame column which contains the index data. If the DataFrame has a MultiIndex, this has to be a list or tuple with length equal to the number of levels.
Added in version 1.5.0.
- Returns:
DataFrame with the new index or None if
inplace=True.- Return type:
DataFrame or None
See also
DataFrame.set_indexOpposite of reset_index.
DataFrame.reindexChange to new indices or expand indices.
DataFrame.reindex_likeChange to same indices as other DataFrame.
Examples
>>> df = pd.DataFrame([('bird', 389.0), ... ('bird', 24.0), ... ('mammal', 80.5), ... ('mammal', np.nan)], ... index=['falcon', 'parrot', 'lion', 'monkey'], ... columns=('class', 'max_speed')) >>> df class max_speed falcon bird 389.0 parrot bird 24.0 lion mammal 80.5 monkey mammal NaN
When we reset the index, the old index is added as a column, and a new sequential index is used:
>>> df.reset_index() index class max_speed 0 falcon bird 389.0 1 parrot bird 24.0 2 lion mammal 80.5 3 monkey mammal NaN
We can use the drop parameter to avoid the old index being added as a column:
>>> df.reset_index(drop=True) class max_speed 0 bird 389.0 1 bird 24.0 2 mammal 80.5 3 mammal NaN
You can also use reset_index with MultiIndex.
>>> index = pd.MultiIndex.from_tuples([('bird', 'falcon'), ... ('bird', 'parrot'), ... ('mammal', 'lion'), ... ('mammal', 'monkey')], ... names=['class', 'name']) >>> columns = pd.MultiIndex.from_tuples([('speed', 'max'), ... ('species', 'type')]) >>> df = pd.DataFrame([(389.0, 'fly'), ... (24.0, 'fly'), ... (80.5, 'run'), ... (np.nan, 'jump')], ... index=index, ... columns=columns) >>> df speed species max type class name bird falcon 389.0 fly parrot 24.0 fly mammal lion 80.5 run monkey NaN jump
Using the names parameter, choose a name for the index column:
>>> df.reset_index(names=['classes', 'names']) classes names speed species max type 0 bird falcon 389.0 fly 1 bird parrot 24.0 fly 2 mammal lion 80.5 run 3 mammal monkey NaN jump
If the index has multiple levels, we can reset a subset of them:
>>> df.reset_index(level='class') class speed species max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump
If we are not dropping the index, by default, it is placed in the top level. We can place it in another level:
>>> df.reset_index(level='class', col_level=1) speed species class max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump
When the index is inserted under another level, we can specify under which one with the parameter col_fill:
>>> df.reset_index(level='class', col_level=1, col_fill='species') species speed species class max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump
If we specify a nonexistent level for col_fill, it is created:
>>> df.reset_index(level='class', col_level=1, col_fill='genus') genus speed species class max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump
- rfloordiv(other, axis: Axis = 'columns', level=None, fill_value=None) DataFrame
Get Integer division of dataframe and other, element-wise (binary operator rfloordiv).
Equivalent to
other // dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, floordiv.Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- rmod(other, axis: Axis = 'columns', level=None, fill_value=None) DataFrame
Get Modulo of dataframe and other, element-wise (binary operator rmod).
Equivalent to
other % dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, mod.Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- rmul(other, axis: Axis = 'columns', level=None, fill_value=None) DataFrame
Get Multiplication of dataframe and other, element-wise (binary operator rmul).
Equivalent to
other * dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, mul.Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- round(decimals: int | dict[IndexLabel, int] | Series = 0, *args, **kwargs) DataFrame
Round a DataFrame to a variable number of decimal places.
- Parameters:
decimals (int, dict, Series) – Number of decimal places to round each column to. If an int is given, round each column to the same number of places. Otherwise dict and Series round to variable numbers of places. Column names should be in the keys if decimals is a dict-like, or in the index if decimals is a Series. Any columns not included in decimals will be left as is. Elements of decimals which are not columns of the input will be ignored.
*args – Additional keywords have no effect but might be accepted for compatibility with numpy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with numpy.
- Returns:
A DataFrame with the affected columns rounded to the specified number of decimal places.
- Return type:
See also
numpy.aroundRound a numpy array to the given number of decimals.
Series.roundRound a Series to the given number of decimals.
Examples
>>> df = pd.DataFrame([(.21, .32), (.01, .67), (.66, .03), (.21, .18)], ... columns=['dogs', 'cats']) >>> df dogs cats 0 0.21 0.32 1 0.01 0.67 2 0.66 0.03 3 0.21 0.18
By providing an integer each column is rounded to the same number of decimal places
>>> df.round(1) dogs cats 0 0.2 0.3 1 0.0 0.7 2 0.7 0.0 3 0.2 0.2
With a dict, the number of places for specific columns can be specified with the column names as key and the number of decimal places as value
>>> df.round({'dogs': 1, 'cats': 0}) dogs cats 0 0.2 0.0 1 0.0 1.0 2 0.7 0.0 3 0.2 0.0
Using a Series, the number of places for specific columns can be specified with the column names as index and the number of decimal places as value
>>> decimals = pd.Series([0, 1], index=['cats', 'dogs']) >>> df.round(decimals) dogs cats 0 0.2 0.0 1 0.0 1.0 2 0.7 0.0 3 0.2 0.0
- rpow(other, axis: Axis = 'columns', level=None, fill_value=None) DataFrame
Get Exponential power of dataframe and other, element-wise (binary operator rpow).
Equivalent to
other ** dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, pow.Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- rsub(other, axis: Axis = 'columns', level=None, fill_value=None) DataFrame
Get Subtraction of dataframe and other, element-wise (binary operator rsub).
Equivalent to
other - dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, sub.Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- rtruediv(other, axis: Axis = 'columns', level=None, fill_value=None) DataFrame
Get Floating division of dataframe and other, element-wise (binary operator rtruediv).
Equivalent to
other / dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, truediv.Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- select_dtypes(include=None, exclude=None) Self
Return a subset of the DataFrame’s columns based on the column dtypes.
- Parameters:
include (scalar or list-like) – A selection of dtypes or strings to be included/excluded. At least one of these parameters must be supplied.
exclude (scalar or list-like) – A selection of dtypes or strings to be included/excluded. At least one of these parameters must be supplied.
- Returns:
The subset of the frame including the dtypes in
includeand excluding the dtypes inexclude.- Return type:
- Raises:
ValueError –
If both of
includeandexcludeare empty * Ifincludeandexcludehave overlapping elements * If any kind of string dtype is passed in.
See also
DataFrame.dtypesReturn Series with the data type of each column.
Notes
To select all numeric types, use
np.numberor'number'To select strings you must use the
objectdtype, but note that this will return all object dtype columnsSee the numpy dtype hierarchy
To select datetimes, use
np.datetime64,'datetime'or'datetime64'To select timedeltas, use
np.timedelta64,'timedelta'or'timedelta64'To select Pandas categorical dtypes, use
'category'To select Pandas datetimetz dtypes, use
'datetimetz'or'datetime64[ns, tz]'
Examples
>>> df = pd.DataFrame({'a': [1, 2] * 3, ... 'b': [True, False] * 3, ... 'c': [1.0, 2.0] * 3}) >>> df a b c 0 1 True 1.0 1 2 False 2.0 2 1 True 1.0 3 2 False 2.0 4 1 True 1.0 5 2 False 2.0
>>> df.select_dtypes(include='bool') b 0 True 1 False 2 True 3 False 4 True 5 False
>>> df.select_dtypes(include=['float64']) c 0 1.0 1 2.0 2 1.0 3 2.0 4 1.0 5 2.0
>>> df.select_dtypes(exclude=['int64']) b c 0 True 1.0 1 False 2.0 2 True 1.0 3 False 2.0 4 True 1.0 5 False 2.0
- sem(axis: Axis | None = 0, skipna: bool = True, ddof: int = 1, numeric_only: bool = False, **kwargs)
Return unbiased standard error of the mean over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
- Parameters:
axis ({index (0), columns (1)}) –
For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.sem with
axis=Noneis deprecated, in a future version this will reduce over both axes and return a scalar To retain the old behavior, pass axis=0 (or do not pass axis).skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
- Returns:
Examples
>>> s = pd.Series([1, 2, 3]) >>> s.sem().round(6) 0.57735
With a DataFrame
>>> df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['tiger', 'zebra']) >>> df a b tiger 1 2 zebra 2 3 >>> df.sem() a 0.5 b 0.5 dtype: float64
Using axis=1
>>> df.sem(axis=1) tiger 0.5 zebra 0.5 dtype: float64
In this case, numeric_only should be set to True to avoid getting an error.
>>> df = pd.DataFrame({'a': [1, 2], 'b': ['T', 'Z']}, ... index=['tiger', 'zebra']) >>> df.sem(numeric_only=True) a 0.5 dtype: float64
- Return type:
- set_axis(labels, *, axis: Axis = 0, copy: bool | None = None) DataFrame
Assign desired index to given axis.
Indexes for column or row labels can be changed by assigning a list-like or Index.
- Parameters:
labels (list-like, Index) – The values for the new index.
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to update. The value 0 identifies the rows. For Series this parameter is unused and defaults to 0.
copy (bool, default True) –
Whether to make a copy of the underlying data.
Note
The copy keyword will change behavior in pandas 3.0. Copy-on-Write will be enabled by default, which means that all methods with a copy keyword will use a lazy copy mechanism to defer the copy and ignore the copy keyword. The copy keyword will be removed in a future version of pandas.
You can already get the future behavior and improvements through enabling copy on write
pd.options.mode.copy_on_write = True
- Returns:
An object of type DataFrame.
- Return type:
See also
DataFrame.rename_axisAlter the name of the index or columns. Examples ——– >>> df = pd.DataFrame({“A”: [1, 2, 3], “B”: [4, 5, 6]}) Change the row labels. >>> df.set_axis([‘a’, ‘b’, ‘c’], axis=’index’) A B a 1 4 b 2 5 c 3 6 Change the column labels. >>> df.set_axis([‘I’, ‘II’], axis=’columns’) I II 0 1 4 1 2 5 2 3 6
- set_index(keys, *, drop: bool = True, append: bool = False, inplace: Literal[False] = False, verify_integrity: bool = False) DataFrame
- set_index(keys, *, drop: bool = True, append: bool = False, inplace: Literal[True], verify_integrity: bool = False) None
Set the DataFrame index using existing columns.
Set the DataFrame index (row labels) using one or more existing columns or arrays (of the correct length). The index can replace the existing index or expand on it.
- Parameters:
keys (label or array-like or list of labels/arrays) – This parameter can be either a single column key, a single array of the same length as the calling DataFrame, or a list containing an arbitrary combination of column keys and arrays. Here, “array” encompasses
Series,Index,np.ndarray, and instances ofIterator.drop (bool, default True) – Delete columns to be used as the new index.
append (bool, default False) – Whether to append columns to existing index.
inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.
verify_integrity (bool, default False) – Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method.
- Returns:
Changed row labels or None if
inplace=True.- Return type:
DataFrame or None
See also
DataFrame.reset_indexOpposite of set_index.
DataFrame.reindexChange to new indices or expand indices.
DataFrame.reindex_likeChange to same indices as other DataFrame.
Examples
>>> df = pd.DataFrame({'month': [1, 4, 7, 10], ... 'year': [2012, 2014, 2013, 2014], ... 'sale': [55, 40, 84, 31]}) >>> df month year sale 0 1 2012 55 1 4 2014 40 2 7 2013 84 3 10 2014 31
Set the index to become the ‘month’ column:
>>> df.set_index('month') year sale month 1 2012 55 4 2014 40 7 2013 84 10 2014 31
Create a MultiIndex using columns ‘year’ and ‘month’:
>>> df.set_index(['year', 'month']) sale year month 2012 1 55 2014 4 40 2013 7 84 2014 10 31
Create a MultiIndex using an Index and a column:
>>> df.set_index([pd.Index([1, 2, 3, 4]), 'year']) month sale year 1 2012 1 55 2 2014 4 40 3 2013 7 84 4 2014 10 31
Create a MultiIndex using two Series:
>>> s = pd.Series([1, 2, 3, 4]) >>> df.set_index([s, s**2]) month year sale 1 1 1 2012 55 2 4 4 2014 40 3 9 7 2013 84 4 16 10 2014 31
- property shape: tuple[int, int]
Return a tuple representing the dimensionality of the DataFrame.
See also
ndarray.shapeTuple of array dimensions.
Examples
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df.shape (2, 2)
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4], ... 'col3': [5, 6]}) >>> df.shape (2, 3)
- shift(periods: int | Sequence[int] = 1, freq: Frequency | None = None, axis: Axis = 0, fill_value: Hashable = <no_default>, suffix: str | None = None) DataFrame
Shift index by desired number of periods with an optional time freq.
When freq is not passed, shift the index without realigning the data. If freq is passed (in this case, the index must be date or datetime, or it will raise a NotImplementedError), the index will be increased using the periods and the freq. freq can be inferred when specified as “infer” as long as either freq or inferred_freq attribute is set in the index.
- Parameters:
periods (int or Sequence) – Number of periods to shift. Can be positive or negative. If an iterable of ints, the data will be shifted once by each int. This is equivalent to shifting by one value at a time and concatenating all resulting frames. The resulting columns will have the shift suffixed to their column names. For multiple periods, axis must not be 1.
freq (DateOffset, tseries.offsets, timedelta, or str, optional) – Offset to use from the tseries module or time rule (e.g. ‘EOM’). If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data. If freq is specified as “infer” then it will be inferred from the freq or inferred_freq attributes of the index. If neither of those attributes exist, a ValueError is thrown.
axis ({0 or 'index', 1 or 'columns', None}, default None) – Shift direction. For Series this parameter is unused and defaults to 0.
fill_value (object, optional) – The scalar value to use for newly introduced missing values. the default depends on the dtype of self. For numeric data,
np.nanis used. For datetime, timedelta, or period data, etc.NaTis used. For extension dtypes,self.dtype.na_valueis used.suffix (str, optional) – If str and periods is an iterable, this is added after the column name and before the shift value for each shifted column name.
- Returns:
Copy of input object, shifted.
- Return type:
See also
Index.shiftShift values of Index.
DatetimeIndex.shiftShift values of DatetimeIndex.
PeriodIndex.shiftShift values of PeriodIndex.
Examples
>>> df = pd.DataFrame({"Col1": [10, 20, 15, 30, 45], ... "Col2": [13, 23, 18, 33, 48], ... "Col3": [17, 27, 22, 37, 52]}, ... index=pd.date_range("2020-01-01", "2020-01-05")) >>> df Col1 Col2 Col3 2020-01-01 10 13 17 2020-01-02 20 23 27 2020-01-03 15 18 22 2020-01-04 30 33 37 2020-01-05 45 48 52
>>> df.shift(periods=3) Col1 Col2 Col3 2020-01-01 NaN NaN NaN 2020-01-02 NaN NaN NaN 2020-01-03 NaN NaN NaN 2020-01-04 10.0 13.0 17.0 2020-01-05 20.0 23.0 27.0
>>> df.shift(periods=1, axis="columns") Col1 Col2 Col3 2020-01-01 NaN 10 13 2020-01-02 NaN 20 23 2020-01-03 NaN 15 18 2020-01-04 NaN 30 33 2020-01-05 NaN 45 48
>>> df.shift(periods=3, fill_value=0) Col1 Col2 Col3 2020-01-01 0 0 0 2020-01-02 0 0 0 2020-01-03 0 0 0 2020-01-04 10 13 17 2020-01-05 20 23 27
>>> df.shift(periods=3, freq="D") Col1 Col2 Col3 2020-01-04 10 13 17 2020-01-05 20 23 27 2020-01-06 15 18 22 2020-01-07 30 33 37 2020-01-08 45 48 52
>>> df.shift(periods=3, freq="infer") Col1 Col2 Col3 2020-01-04 10 13 17 2020-01-05 20 23 27 2020-01-06 15 18 22 2020-01-07 30 33 37 2020-01-08 45 48 52
>>> df['Col1'].shift(periods=[0, 1, 2]) Col1_0 Col1_1 Col1_2 2020-01-01 10 NaN NaN 2020-01-02 20 10.0 NaN 2020-01-03 15 20.0 10.0 2020-01-04 30 15.0 20.0 2020-01-05 45 30.0 15.0
- skew(axis: Axis | None = 0, skipna: bool = True, numeric_only: bool = False, **kwargs)
Return unbiased skew over requested axis.
Normalized by N-1.
- Parameters:
axis ({index (0), columns (1)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.Added in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Returns:
Examples
>>> s = pd.Series([1, 2, 3]) >>> s.skew() 0.0
With a DataFrame
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [2, 3, 4], 'c': [1, 3, 5]}, ... index=['tiger', 'zebra', 'cow']) >>> df a b c tiger 1 2 1 zebra 2 3 3 cow 3 4 5 >>> df.skew() a 0.0 b 0.0 c 0.0 dtype: float64
Using axis=1
>>> df.skew(axis=1) tiger 1.732051 zebra -1.732051 cow 0.000000 dtype: float64
In this case, numeric_only should be set to True to avoid getting an error.
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': ['T', 'Z', 'X']}, ... index=['tiger', 'zebra', 'cow']) >>> df.skew(numeric_only=True) a 0.0 dtype: float64
- Return type:
Series or scalar
- sort_index(*, axis: Axis = 0, level: IndexLabel = None, ascending: bool | Sequence[bool] = True, inplace: Literal[True], kind: SortKind = 'quicksort', na_position: NaPosition = 'last', sort_remaining: bool = True, ignore_index: bool = False, key: IndexKeyFunc = None) None
- sort_index(*, axis: Axis = 0, level: IndexLabel = None, ascending: bool | Sequence[bool] = True, inplace: Literal[False] = False, kind: SortKind = 'quicksort', na_position: NaPosition = 'last', sort_remaining: bool = True, ignore_index: bool = False, key: IndexKeyFunc = None) DataFrame
- sort_index(*, axis: Axis = 0, level: IndexLabel = None, ascending: bool | Sequence[bool] = True, inplace: bool = False, kind: SortKind = 'quicksort', na_position: NaPosition = 'last', sort_remaining: bool = True, ignore_index: bool = False, key: IndexKeyFunc = None) DataFrame | None
Sort object by labels (along an axis).
Returns a new DataFrame sorted by label if inplace argument is
False, otherwise updates the original DataFrame and returns None.- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis along which to sort. The value 0 identifies the rows, and 1 identifies the columns.
level (int or level name or list of ints or list of level names) – If not None, sort on values in specified index level(s).
ascending (bool or list-like of bools, default True) – Sort ascending vs. descending. When the index is a MultiIndex the sort direction can be controlled for each level individually.
inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.
kind ({'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort') – Choice of sorting algorithm. See also
numpy.sort()for more information. mergesort and stable are the only stable algorithms. For DataFrames, this option is only applied when sorting on a single column or label.na_position ({'first', 'last'}, default 'last') – Puts NaNs at the beginning if first; last puts NaNs at the end. Not implemented for MultiIndex.
sort_remaining (bool, default True) – If True and sorting by level and index is multilevel, sort by other levels too (in order) after sorting by specified level.
ignore_index (bool, default False) – If True, the resulting axis will be labeled 0, 1, …, n - 1.
key (callable, optional) – If not None, apply the key function to the index values before sorting. This is similar to the key argument in the builtin
sorted()function, with the notable difference that this key function should be vectorized. It should expect anIndexand return anIndexof the same shape. For MultiIndex inputs, the key is applied per level.
- Returns:
The original DataFrame sorted by the labels or None if
inplace=True.- Return type:
DataFrame or None
See also
Series.sort_indexSort Series by the index.
DataFrame.sort_valuesSort DataFrame by the value.
Series.sort_valuesSort Series by the value.
Examples
>>> df = pd.DataFrame([1, 2, 3, 4, 5], index=[100, 29, 234, 1, 150], ... columns=['A']) >>> df.sort_index() A 1 4 29 2 100 1 150 5 234 3
By default, it sorts in ascending order, to sort in descending order, use
ascending=False>>> df.sort_index(ascending=False) A 234 3 150 5 100 1 29 2 1 4
A key function can be specified which is applied to the index before sorting. For a
MultiIndexthis is applied to each level separately.>>> df = pd.DataFrame({"a": [1, 2, 3, 4]}, index=['A', 'b', 'C', 'd']) >>> df.sort_index(key=lambda x: x.str.lower()) a A 1 b 2 C 3 d 4
- sort_values(by: IndexLabel, *, axis: Axis = 0, ascending=True, inplace: Literal[False] = False, kind: SortKind = 'quicksort', na_position: NaPosition = 'last', ignore_index: bool = False, key: ValueKeyFunc = None) DataFrame
- sort_values(by: IndexLabel, *, axis: Axis = 0, ascending=True, inplace: Literal[True], kind: SortKind = 'quicksort', na_position: str = 'last', ignore_index: bool = False, key: ValueKeyFunc = None) None
Sort by the values along either axis.
- Parameters:
by (str or list of str) –
Name or list of names to sort by.
if axis is 0 or ‘index’ then by may contain index levels and/or column labels.
if axis is 1 or ‘columns’ then by may contain column levels and/or index labels.
axis ("{0 or 'index', 1 or 'columns'}", default 0) – Axis to be sorted.
ascending (bool or list of bool, default True) – Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.
inplace (bool, default False) – If True, perform operation in-place.
kind ({'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort') – Choice of sorting algorithm. See also
numpy.sort()for more information. mergesort and stable are the only stable algorithms. For DataFrames, this option is only applied when sorting on a single column or label.na_position ({'first', 'last'}, default 'last') – Puts NaNs at the beginning if first; last puts NaNs at the end.
ignore_index (bool, default False) – If True, the resulting axis will be labeled 0, 1, …, n - 1.
key (callable, optional) – Apply the key function to the values before sorting. This is similar to the key argument in the builtin
sorted()function, with the notable difference that this key function should be vectorized. It should expect aSeriesand return a Series with the same shape as the input. It will be applied to each column in by independently.
- Returns:
DataFrame with sorted values or None if
inplace=True.- Return type:
DataFrame or None
See also
DataFrame.sort_indexSort a DataFrame by the index.
Series.sort_valuesSimilar method for a Series.
Examples
>>> df = pd.DataFrame({ ... 'col1': ['A', 'A', 'B', np.nan, 'D', 'C'], ... 'col2': [2, 1, 9, 8, 7, 4], ... 'col3': [0, 1, 9, 4, 2, 3], ... 'col4': ['a', 'B', 'c', 'D', 'e', 'F'] ... }) >>> df col1 col2 col3 col4 0 A 2 0 a 1 A 1 1 B 2 B 9 9 c 3 NaN 8 4 D 4 D 7 2 e 5 C 4 3 F
Sort by col1
>>> df.sort_values(by=['col1']) col1 col2 col3 col4 0 A 2 0 a 1 A 1 1 B 2 B 9 9 c 5 C 4 3 F 4 D 7 2 e 3 NaN 8 4 D
Sort by multiple columns
>>> df.sort_values(by=['col1', 'col2']) col1 col2 col3 col4 1 A 1 1 B 0 A 2 0 a 2 B 9 9 c 5 C 4 3 F 4 D 7 2 e 3 NaN 8 4 D
Sort Descending
>>> df.sort_values(by='col1', ascending=False) col1 col2 col3 col4 4 D 7 2 e 5 C 4 3 F 2 B 9 9 c 0 A 2 0 a 1 A 1 1 B 3 NaN 8 4 D
Putting NAs first
>>> df.sort_values(by='col1', ascending=False, na_position='first') col1 col2 col3 col4 3 NaN 8 4 D 4 D 7 2 e 5 C 4 3 F 2 B 9 9 c 0 A 2 0 a 1 A 1 1 B
Sorting with a key function
>>> df.sort_values(by='col4', key=lambda col: col.str.lower()) col1 col2 col3 col4 0 A 2 0 a 1 A 1 1 B 2 B 9 9 c 3 NaN 8 4 D 4 D 7 2 e 5 C 4 3 F
Natural sort with the key argument, using the natsort <https://github.com/SethMMorton/natsort> package.
>>> df = pd.DataFrame({ ... "time": ['0hr', '128hr', '72hr', '48hr', '96hr'], ... "value": [10, 20, 30, 40, 50] ... }) >>> df time value 0 0hr 10 1 128hr 20 2 72hr 30 3 48hr 40 4 96hr 50 >>> from natsort import index_natsorted >>> df.sort_values( ... by="time", ... key=lambda x: np.argsort(index_natsorted(df["time"])) ... ) time value 0 0hr 10 3 48hr 40 2 72hr 30 4 96hr 50 1 128hr 20
- sparse
alias of
SparseFrameAccessor
- stack(level: IndexLabel = -1, dropna: bool | lib.NoDefault = <no_default>, sort: bool | lib.NoDefault = <no_default>, future_stack: bool = False)
Stack the prescribed level(s) from columns to index.
Return a reshaped DataFrame or Series having a multi-level index with one or more new inner-most levels compared to the current DataFrame. The new inner-most levels are created by pivoting the columns of the current dataframe:
if the columns have a single level, the output is a Series;
if the columns have multiple levels, the new index level(s) is (are) taken from the prescribed level(s) and the output is a DataFrame.
- Parameters:
level (int, str, list, default -1) – Level(s) to stack from the column axis onto the index axis, defined as one index or label, or a list of indices or labels.
dropna (bool, default True) – Whether to drop rows in the resulting Frame/Series with missing values. Stacking a column level onto the index axis can create combinations of index and column values that are missing from the original dataframe. See Examples section.
sort (bool, default True) – Whether to sort the levels of the resulting MultiIndex.
future_stack (bool, default False) – Whether to use the new implementation that will replace the current implementation in pandas 3.0. When True, dropna and sort have no impact on the result and must remain unspecified. See pandas 2.1.0 Release notes for more details.
- Returns:
Stacked dataframe or series.
- Return type:
See also
DataFrame.unstackUnstack prescribed level(s) from index axis onto column axis.
DataFrame.pivotReshape dataframe from long format to wide format.
DataFrame.pivot_tableCreate a spreadsheet-style pivot table as a DataFrame.
Notes
The function is named by analogy with a collection of books being reorganized from being side by side on a horizontal position (the columns of the dataframe) to being stacked vertically on top of each other (in the index of the dataframe).
Reference the user guide for more examples.
Examples
Single level columns
>>> df_single_level_cols = pd.DataFrame([[0, 1], [2, 3]], ... index=['cat', 'dog'], ... columns=['weight', 'height'])
Stacking a dataframe with a single level column axis returns a Series:
>>> df_single_level_cols weight height cat 0 1 dog 2 3 >>> df_single_level_cols.stack(future_stack=True) cat weight 0 height 1 dog weight 2 height 3 dtype: int64
Multi level columns: simple case
>>> multicol1 = pd.MultiIndex.from_tuples([('weight', 'kg'), ... ('weight', 'pounds')]) >>> df_multi_level_cols1 = pd.DataFrame([[1, 2], [2, 4]], ... index=['cat', 'dog'], ... columns=multicol1)
Stacking a dataframe with a multi-level column axis:
>>> df_multi_level_cols1 weight kg pounds cat 1 2 dog 2 4 >>> df_multi_level_cols1.stack(future_stack=True) weight cat kg 1 pounds 2 dog kg 2 pounds 4
Missing values
>>> multicol2 = pd.MultiIndex.from_tuples([('weight', 'kg'), ... ('height', 'm')]) >>> df_multi_level_cols2 = pd.DataFrame([[1.0, 2.0], [3.0, 4.0]], ... index=['cat', 'dog'], ... columns=multicol2)
It is common to have missing values when stacking a dataframe with multi-level columns, as the stacked dataframe typically has more values than the original dataframe. Missing values are filled with NaNs:
>>> df_multi_level_cols2 weight height kg m cat 1.0 2.0 dog 3.0 4.0 >>> df_multi_level_cols2.stack(future_stack=True) weight height cat kg 1.0 NaN m NaN 2.0 dog kg 3.0 NaN m NaN 4.0
Prescribing the level(s) to be stacked
The first parameter controls which level or levels are stacked:
>>> df_multi_level_cols2.stack(0, future_stack=True) kg m cat weight 1.0 NaN height NaN 2.0 dog weight 3.0 NaN height NaN 4.0 >>> df_multi_level_cols2.stack([0, 1], future_stack=True) cat weight kg 1.0 height m 2.0 dog weight kg 3.0 height m 4.0 dtype: float64
- std(axis: Axis | None = 0, skipna: bool = True, ddof: int = 1, numeric_only: bool = False, **kwargs)
Return sample standard deviation over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument.
- Parameters:
axis ({index (0), columns (1)}) –
For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.std with
axis=Noneis deprecated, in a future version this will reduce over both axes and return a scalar To retain the old behavior, pass axis=0 (or do not pass axis).skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
- Return type:
Notes
To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)
Examples
>>> df = pd.DataFrame({'person_id': [0, 1, 2, 3], ... 'age': [21, 25, 62, 43], ... 'height': [1.61, 1.87, 1.49, 2.01]} ... ).set_index('person_id') >>> df age height person_id 0 21 1.61 1 25 1.87 2 62 1.49 3 43 2.01
The standard deviation of the columns can be found as follows:
>>> df.std() age 18.786076 height 0.237417 dtype: float64
Alternatively, ddof=0 can be set to normalize by N instead of N-1:
>>> df.std(ddof=0) age 16.269219 height 0.205609 dtype: float64
- property style: Styler
Returns a Styler object.
Contains methods for building a styled HTML representation of the DataFrame.
See also
io.formats.style.StylerHelps style a DataFrame or Series according to the data with HTML and CSS.
Examples
>>> df = pd.DataFrame({'A': [1, 2, 3]}) >>> df.style
Please see Table Visualization for more examples.
- sub(other, axis: Axis = 'columns', level=None, fill_value=None) DataFrame
Get Subtraction of dataframe and other, element-wise (binary operator sub).
Equivalent to
dataframe - other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rsub.Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- subtract(other, axis: Axis = 'columns', level=None, fill_value=None) DataFrame
Get Subtraction of dataframe and other, element-wise (binary operator sub).
Equivalent to
dataframe - other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rsub.Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- sum(axis: Axis | None = 0, skipna: bool = True, numeric_only: bool = False, min_count: int = 0, **kwargs)
Return the sum of the values over the requested axis.
This is equivalent to the method
numpy.sum.- Parameters:
axis ({index (0), columns (1)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.sum with
axis=Noneis deprecated, in a future version this will reduce over both axes and return a scalar To retain the old behavior, pass axis=0 (or do not pass axis).Added in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than
min_countnon-NA values are present the result will be NA.**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
Series or scalar
See also
Series.sumReturn the sum.
Series.minReturn the minimum.
Series.maxReturn the maximum.
Series.idxminReturn the index of the minimum.
Series.idxmaxReturn the index of the maximum.
DataFrame.sumReturn the sum over the requested axis.
DataFrame.minReturn the minimum over the requested axis.
DataFrame.maxReturn the maximum over the requested axis.
DataFrame.idxminReturn the index of the minimum over the requested axis.
DataFrame.idxmaxReturn the index of the maximum over the requested axis.
Examples
>>> idx = pd.MultiIndex.from_arrays([ ... ['warm', 'warm', 'cold', 'cold'], ... ['dog', 'falcon', 'fish', 'spider']], ... names=['blooded', 'animal']) >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> s blooded animal warm dog 4 falcon 2 cold fish 0 spider 8 Name: legs, dtype: int64
>>> s.sum() 14
By default, the sum of an empty or all-NA Series is
0.>>> pd.Series([], dtype="float64").sum() # min_count=0 is the default 0.0
This can be controlled with the
min_countparameter. For example, if you’d like the sum of an empty series to be NaN, passmin_count=1.>>> pd.Series([], dtype="float64").sum(min_count=1) nan
Thanks to the
skipnaparameter,min_counthandles all-NA and empty series identically.>>> pd.Series([np.nan]).sum() 0.0
>>> pd.Series([np.nan]).sum(min_count=1) nan
- swaplevel(i: Axis = -2, j: Axis = -1, axis: Axis = 0) DataFrame
Swap levels i and j in a
MultiIndex.Default is to swap the two innermost levels of the index.
- Parameters:
i (int or str) – Levels of the indices to be swapped. Can pass level name as string.
j (int or str) – Levels of the indices to be swapped. Can pass level name as string.
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to swap levels on. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
- Returns:
DataFrame with levels swapped in MultiIndex.
- Return type:
Examples
>>> df = pd.DataFrame( ... {"Grade": ["A", "B", "A", "C"]}, ... index=[ ... ["Final exam", "Final exam", "Coursework", "Coursework"], ... ["History", "Geography", "History", "Geography"], ... ["January", "February", "March", "April"], ... ], ... ) >>> df Grade Final exam History January A Geography February B Coursework History March A Geography April C
In the following example, we will swap the levels of the indices. Here, we will swap the levels column-wise, but levels can be swapped row-wise in a similar manner. Note that column-wise is the default behaviour. By not supplying any arguments for i and j, we swap the last and second to last indices.
>>> df.swaplevel() Grade Final exam January History A February Geography B Coursework March History A April Geography C
By supplying one argument, we can choose which index to swap the last index with. We can for example swap the first index with the last one as follows.
>>> df.swaplevel(0) Grade January History Final exam A February Geography Final exam B March History Coursework A April Geography Coursework C
We can also define explicitly which indices we want to swap by supplying values for both i and j. Here, we for example swap the first and second indices.
>>> df.swaplevel(0, 1) Grade History Final exam January A Geography Final exam February B History Coursework March A Geography Coursework April C
- to_dict(orient: Literal['dict', 'list', 'series', 'split', 'tight', 'index'] = 'dict', *, into: type[MutableMappingT] | MutableMappingT, index: bool = True) MutableMappingT
- to_dict(orient: Literal['records'], *, into: type[MutableMappingT] | MutableMappingT, index: bool = True) list[MutableMappingT]
- to_dict(orient: Literal['dict', 'list', 'series', 'split', 'tight', 'index'] = 'dict', *, into: type[dict] = <class 'dict'>, index: bool = True) dict
- to_dict(orient: Literal['records'], *, into: type[dict] = <class 'dict'>, index: bool = True) list[dict]
Convert the DataFrame to a dictionary.
The type of the key-value pairs can be customized with the parameters (see below).
- Parameters:
orient (str {'dict', 'list', 'series', 'split', 'tight', 'records', 'index'}) –
Determines the type of the values of the dictionary.
’dict’ (default) : dict like {column -> {index -> value}}
’list’ : dict like {column -> [values]}
’series’ : dict like {column -> Series(values)}
’split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}
’tight’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values], ‘index_names’ -> [index.names], ‘column_names’ -> [column.names]}
’records’ : list like [{column -> value}, … , {column -> value}]
’index’ : dict like {index -> {column -> value}}
Added in version 1.4.0: ‘tight’ as an allowed value for the
orientargumentinto (class, default dict) – The collections.abc.MutableMapping subclass used for all Mappings in the return value. Can be the actual class or an empty instance of the mapping type you want. If you want a collections.defaultdict, you must pass it initialized.
index (bool, default True) –
Whether to include the index item (and index_names item if orient is ‘tight’) in the returned dictionary. Can only be
Falsewhen orient is ‘split’ or ‘tight’.Added in version 2.0.0.
- Returns:
Return a collections.abc.MutableMapping object representing the DataFrame. The resulting transformation depends on the orient parameter.
- Return type:
dict, list or collections.abc.MutableMapping
See also
DataFrame.from_dictCreate a DataFrame from a dictionary.
DataFrame.to_jsonConvert a DataFrame to JSON format.
Examples
>>> df = pd.DataFrame({'col1': [1, 2], ... 'col2': [0.5, 0.75]}, ... index=['row1', 'row2']) >>> df col1 col2 row1 1 0.50 row2 2 0.75 >>> df.to_dict() {'col1': {'row1': 1, 'row2': 2}, 'col2': {'row1': 0.5, 'row2': 0.75}}
You can specify the return orientation.
>>> df.to_dict('series') {'col1': row1 1 row2 2 Name: col1, dtype: int64, 'col2': row1 0.50 row2 0.75 Name: col2, dtype: float64}
>>> df.to_dict('split') {'index': ['row1', 'row2'], 'columns': ['col1', 'col2'], 'data': [[1, 0.5], [2, 0.75]]}
>>> df.to_dict('records') [{'col1': 1, 'col2': 0.5}, {'col1': 2, 'col2': 0.75}]
>>> df.to_dict('index') {'row1': {'col1': 1, 'col2': 0.5}, 'row2': {'col1': 2, 'col2': 0.75}}
>>> df.to_dict('tight') {'index': ['row1', 'row2'], 'columns': ['col1', 'col2'], 'data': [[1, 0.5], [2, 0.75]], 'index_names': [None], 'column_names': [None]}
You can also specify the mapping type.
>>> from collections import OrderedDict, defaultdict >>> df.to_dict(into=OrderedDict) OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])), ('col2', OrderedDict([('row1', 0.5), ('row2', 0.75)]))])
If you want a defaultdict, you need to initialize it:
>>> dd = defaultdict(list) >>> df.to_dict('records', into=dd) [defaultdict(<class 'list'>, {'col1': 1, 'col2': 0.5}), defaultdict(<class 'list'>, {'col1': 2, 'col2': 0.75})]
- to_feather(path: FilePath | WriteBuffer[bytes], **kwargs) None
Write a DataFrame to the binary Feather format.
- Parameters:
path (str, path object, file-like object) – String, path object (implementing
os.PathLike[str]), or file-like object implementing a binarywrite()function. If a string or a path, it will be used as Root Directory path when writing a partitioned dataset.**kwargs – Additional keywords passed to
pyarrow.feather.write_feather(). This includes the compression, compression_level, chunksize and version keywords.
Notes
This function writes the dataframe as a feather file. Requires a default index. For saving the DataFrame with your custom index use a method that supports custom indices e.g. to_parquet.
Examples
>>> df = pd.DataFrame([[1, 2, 3], [4, 5, 6]]) >>> df.to_feather("file.feather")
- to_gbq(destination_table: str, *, project_id: str | None = None, chunksize: int | None = None, reauth: bool = False, if_exists: ToGbqIfexist = 'fail', auth_local_webserver: bool = True, table_schema: list[dict[str, str]] | None = None, location: str | None = None, progress_bar: bool = True, credentials=None) None
Write a DataFrame to a Google BigQuery table.
Deprecated since version 2.2.0: Please use
pandas_gbq.to_gbqinstead.This function requires the pandas-gbq package.
See the How to authenticate with Google BigQuery guide for authentication instructions.
- Parameters:
destination_table (str) – Name of table to be written, in the form
dataset.tablename.project_id (str, optional) – Google BigQuery Account project ID. Optional when available from the environment.
chunksize (int, optional) – Number of rows to be inserted in each chunk from the dataframe. Set to
Noneto load the whole dataframe at once.reauth (bool, default False) – Force Google BigQuery to re-authenticate the user. This is useful if multiple accounts are used.
if_exists (str, default 'fail') –
Behavior when the destination table exists. Value can be one of:
'fail'If table exists raise pandas_gbq.gbq.TableCreationError.
'replace'If table exists, drop it, recreate it, and insert data.
'append'If table exists, insert data. Create if does not exist.
auth_local_webserver (bool, default True) –
Use the local webserver flow instead of the console flow when getting user credentials.
New in version 0.2.0 of pandas-gbq.
Changed in version 1.5.0: Default value is changed to
True. Google has deprecated theauth_local_webserver = False“out of band” (copy-paste) flow.table_schema (list of dicts, optional) –
List of BigQuery table fields to which according DataFrame columns conform to, e.g.
[{'name': 'col1', 'type': 'STRING'},...]. If schema is not provided, it will be generated according to dtypes of DataFrame columns. See BigQuery API documentation on available names of a field.New in version 0.3.1 of pandas-gbq.
location (str, optional) –
Location where the load job should run. See the BigQuery locations documentation for a list of available locations. The location must match that of the target dataset.
New in version 0.5.0 of pandas-gbq.
progress_bar (bool, default True) –
Use the library tqdm to show the progress bar for the upload, chunk by chunk.
New in version 0.5.0 of pandas-gbq.
credentials (google.auth.credentials.Credentials, optional) –
Credentials for accessing Google APIs. Use this parameter to override default credentials, such as to use Compute Engine
google.auth.compute_engine.Credentialsor Service Accountgoogle.oauth2.service_account.Credentialsdirectly.New in version 0.8.0 of pandas-gbq.
See also
pandas_gbq.to_gbqThis function in the pandas-gbq library.
read_gbqRead a DataFrame from Google BigQuery.
Examples
Example taken from Google BigQuery documentation
>>> project_id = "my-project" >>> table_id = 'my_dataset.my_table' >>> df = pd.DataFrame({ ... "my_string": ["a", "b", "c"], ... "my_int64": [1, 2, 3], ... "my_float64": [4.0, 5.0, 6.0], ... "my_bool1": [True, False, True], ... "my_bool2": [False, True, False], ... "my_dates": pd.date_range("now", periods=3), ... } ... )
>>> df.to_gbq(table_id, project_id=project_id)
- to_html(buf: FilePath | WriteBuffer[str], columns: Axes | None = None, col_space: ColspaceArgType | None = None, header: bool = True, index: bool = True, na_rep: str = 'NaN', formatters: FormattersType | None = None, float_format: FloatFormatType | None = None, sparsify: bool | None = None, index_names: bool = True, justify: str | None = None, max_rows: int | None = None, max_cols: int | None = None, show_dimensions: bool | str = False, decimal: str = '.', bold_rows: bool = True, classes: str | list | tuple | None = None, escape: bool = True, notebook: bool = False, border: int | bool | None = None, table_id: str | None = None, render_links: bool = False, encoding: str | None = None) None
- to_html(buf: None = None, columns: Axes | None = None, col_space: ColspaceArgType | None = None, header: bool = True, index: bool = True, na_rep: str = 'NaN', formatters: FormattersType | None = None, float_format: FloatFormatType | None = None, sparsify: bool | None = None, index_names: bool = True, justify: str | None = None, max_rows: int | None = None, max_cols: int | None = None, show_dimensions: bool | str = False, decimal: str = '.', bold_rows: bool = True, classes: str | list | tuple | None = None, escape: bool = True, notebook: bool = False, border: int | bool | None = None, table_id: str | None = None, render_links: bool = False, encoding: str | None = None) str
Render a DataFrame as an HTML table.
- Parameters:
buf (str, Path or StringIO-like, optional, default None) – Buffer to write to. If None, the output is returned as a string.
columns (array-like, optional, default None) – The subset of columns to write. Writes all columns by default.
col_space (str or int, list or dict of int or str, optional) – The minimum width of each column in CSS length units. An int is assumed to be px units..
header (bool, optional) – Whether to print column labels, default True.
index (bool, optional, default True) – Whether to print index (row) labels.
na_rep (str, optional, default 'NaN') – String representation of
NaNto use.formatters (list, tuple or dict of one-param. functions, optional) – Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List/tuple must be of length equal to the number of columns.
float_format (one-parameter function, optional, default None) – Formatter function to apply to columns’ elements if they are floats. This function must return a unicode string and will be applied only to the non-
NaNelements, withNaNbeing handled byna_rep.sparsify (bool, optional, default True) – Set to False for a DataFrame with a hierarchical index to print every multiindex key at each row.
index_names (bool, optional, default True) – Prints the names of the indexes.
justify (str, default None) –
How to justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box. Valid values are
left
right
center
justify
justify-all
start
end
inherit
match-parent
initial
unset.
max_rows (int, optional) – Maximum number of rows to display in the console.
max_cols (int, optional) – Maximum number of columns to display in the console.
show_dimensions (bool, default False) – Display DataFrame dimensions (number of rows by number of columns).
decimal (str, default '.') – Character recognized as decimal separator, e.g. ‘,’ in Europe.
bold_rows (bool, default True) – Make the row labels bold in the output.
classes (str or list or tuple, default None) – CSS class(es) to apply to the resulting html table.
escape (bool, default True) – Convert the characters <, >, and & to HTML-safe sequences.
notebook ({True, False}, default False) – Whether the generated HTML is for IPython Notebook.
border (int) – A
border=borderattribute is included in the opening <table> tag. Defaultpd.options.display.html.border.table_id (str, optional) – A css id is included in the opening <table> tag if specified.
render_links (bool, default False) – Convert URLs to HTML links.
encoding (str, default "utf-8") – Set character encoding.
- Returns:
If buf is None, returns the result as a string. Otherwise returns None.
- Return type:
str or None
See also
to_stringConvert DataFrame to a string.
Examples
>>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [4, 3]}) >>> html_string = '''<table border="1" class="dataframe"> ... <thead> ... <tr style="text-align: right;"> ... <th></th> ... <th>col1</th> ... <th>col2</th> ... </tr> ... </thead> ... <tbody> ... <tr> ... <th>0</th> ... <td>1</td> ... <td>4</td> ... </tr> ... <tr> ... <th>1</th> ... <td>2</td> ... <td>3</td> ... </tr> ... </tbody> ... </table>''' >>> assert html_string == df.to_html()
- to_markdown(buf: FilePath | WriteBuffer[str] | None = None, *, mode: str = 'wt', index: bool = True, storage_options: StorageOptions | None = None, **kwargs) str | None
Print DataFrame in Markdown-friendly format.
- Parameters:
buf (str, Path or StringIO-like, optional, default None) – Buffer to write to. If None, the output is returned as a string.
mode (str, optional) – Mode in which file is opened, “wt” by default.
index (bool, optional, default True) – Add index (row) labels.
storage_options (dict, optional) – Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Requestas header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open. Please seefsspecandurllibfor more details, and for more examples on storage options refer here.**kwargs – These parameters will be passed to tabulate.
- Returns:
DataFrame in Markdown-friendly format.
- Return type:
str
Notes
Requires the tabulate package.
- Examples
>>> df = pd.DataFrame( ... data={"animal_1": ["elk", "pig"], "animal_2": ["dog", "quetzal"]} ... ) >>> print(df.to_markdown()) | | animal_1 | animal_2 | |---:|:-----------|:-----------| | 0 | elk | dog | | 1 | pig | quetzal |
Output markdown with a tabulate option.
>>> print(df.to_markdown(tablefmt="grid")) +----+------------+------------+ | | animal_1 | animal_2 | +====+============+============+ | 0 | elk | dog | +----+------------+------------+ | 1 | pig | quetzal | +----+------------+------------+
- to_numpy(dtype: npt.DTypeLike | None = None, copy: bool = False, na_value: object = <no_default>) np.ndarray
Convert the DataFrame to a NumPy array.
By default, the dtype of the returned array will be the common NumPy dtype of all types in the DataFrame. For example, if the dtypes are
float16andfloat32, the results dtype will befloat32. This may require copying data and coercing values, which may be expensive.- Parameters:
dtype (str or numpy.dtype, optional) – The dtype to pass to
numpy.asarray().copy (bool, default False) – Whether to ensure that the returned value is not a view on another array. Note that
copy=Falsedoes not ensure thatto_numpy()is no-copy. Rather,copy=Trueensure that a copy is made, even if not strictly necessary.na_value (Any, optional) – The value to use for missing values. The default value depends on dtype and the dtypes of the DataFrame columns.
- Return type:
See also
Series.to_numpySimilar method for Series.
Examples
>>> pd.DataFrame({"A": [1, 2], "B": [3, 4]}).to_numpy() array([[1, 3], [2, 4]])
With heterogeneous data, the lowest common type will have to be used.
>>> df = pd.DataFrame({"A": [1, 2], "B": [3.0, 4.5]}) >>> df.to_numpy() array([[1. , 3. ], [2. , 4.5]])
For a mix of numeric and non-numeric types, the output array will have object dtype.
>>> df['C'] = pd.date_range('2000', periods=2) >>> df.to_numpy() array([[1, 3.0, Timestamp('2000-01-01 00:00:00')], [2, 4.5, Timestamp('2000-01-02 00:00:00')]], dtype=object)
- to_orc(path: FilePath | WriteBuffer[bytes] | None = None, *, engine: Literal['pyarrow'] = 'pyarrow', index: bool | None = None, engine_kwargs: dict[str, Any] | None = None) bytes | None
Write a DataFrame to the ORC format.
Added in version 1.5.0.
- Parameters:
path (str, file-like object or None, default None) – If a string, it will be used as Root Directory path when writing a partitioned dataset. By file-like object, we refer to objects with a write() method, such as a file handle (e.g. via builtin open function). If path is None, a bytes object is returned.
engine ({'pyarrow'}, default 'pyarrow') – ORC library to use.
index (bool, optional) – If
True, include the dataframe’s index(es) in the file output. IfFalse, they will not be written to the file. IfNone, similar toinferthe dataframe’s index(es) will be saved. However, instead of being saved as values, the RangeIndex will be stored as a range in the metadata so it doesn’t require much space and is faster. Other indexes will be included as columns in the file output.engine_kwargs (dict[str, Any] or None, default None) – Additional keyword arguments passed to
pyarrow.orc.write_table().
- Return type:
bytes if no path argument is provided else None
- Raises:
NotImplementedError – Dtype of one or more columns is category, unsigned integers, interval, period or sparse.
ValueError – engine is not pyarrow.
See also
read_orcRead a ORC file.
DataFrame.to_parquetWrite a parquet file.
DataFrame.to_csvWrite a csv file.
DataFrame.to_sqlWrite to a sql table.
DataFrame.to_hdfWrite to hdf.
Notes
Before using this function you should read the user guide about ORC and install optional dependencies.
This function requires pyarrow library.
For supported dtypes please refer to supported ORC features in Arrow.
Currently timezones in datetime columns are not preserved when a dataframe is converted into ORC files.
Examples
>>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [4, 3]}) >>> df.to_orc('df.orc') >>> pd.read_orc('df.orc') col1 col2 0 1 4 1 2 3
If you want to get a buffer to the orc content you can write it to io.BytesIO
>>> import io >>> b = io.BytesIO(df.to_orc()) >>> b.seek(0) 0 >>> content = b.read()
- to_parquet(path: None = None, engine: Literal['auto', 'pyarrow', 'fastparquet'] = 'auto', compression: str | None = 'snappy', index: bool | None = None, partition_cols: list[str] | None = None, storage_options: StorageOptions = None, **kwargs) bytes
- to_parquet(path: FilePath | WriteBuffer[bytes], engine: Literal['auto', 'pyarrow', 'fastparquet'] = 'auto', compression: str | None = 'snappy', index: bool | None = None, partition_cols: list[str] | None = None, storage_options: StorageOptions = None, **kwargs) None
Write a DataFrame to the binary parquet format.
This function writes the dataframe as a parquet file. You can choose different parquet backends, and have the option of compression. See the user guide for more details.
- Parameters:
path (str, path object, file-like object, or None, default None) – String, path object (implementing
os.PathLike[str]), or file-like object implementing a binarywrite()function. If None, the result is returned as bytes. If a string or path, it will be used as Root Directory path when writing a partitioned dataset.engine ({'auto', 'pyarrow', 'fastparquet'}, default 'auto') – Parquet library to use. If ‘auto’, then the option
io.parquet.engineis used. The defaultio.parquet.enginebehavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable.compression (str or None, default 'snappy') – Name of the compression to use. Use
Nonefor no compression. Supported options: ‘snappy’, ‘gzip’, ‘brotli’, ‘lz4’, ‘zstd’.index (bool, default None) – If
True, include the dataframe’s index(es) in the file output. IfFalse, they will not be written to the file. IfNone, similar toTruethe dataframe’s index(es) will be saved. However, instead of being saved as values, the RangeIndex will be stored as a range in the metadata so it doesn’t require much space and is faster. Other indexes will be included as columns in the file output.partition_cols (list, optional, default None) – Column names by which to partition the dataset. Columns are partitioned in the order they are given. Must be None if path is not a string.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Requestas header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open. Please seefsspecandurllibfor more details, and for more examples on storage options refer here.**kwargs – Additional arguments passed to the parquet library. See pandas io for more details.
- Return type:
bytes if no path argument is provided else None
See also
read_parquetRead a parquet file.
DataFrame.to_orcWrite an orc file.
DataFrame.to_csvWrite a csv file.
DataFrame.to_sqlWrite to a sql table.
DataFrame.to_hdfWrite to hdf.
Notes
This function requires either the fastparquet or pyarrow library.
Examples
>>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]}) >>> df.to_parquet('df.parquet.gzip', ... compression='gzip') >>> pd.read_parquet('df.parquet.gzip') col1 col2 0 1 3 1 2 4
If you want to get a buffer to the parquet content you can use a io.BytesIO object, as long as you don’t use partition_cols, which creates multiple files.
>>> import io >>> f = io.BytesIO() >>> df.to_parquet(f) >>> f.seek(0) 0 >>> content = f.read()
- to_period(freq: Frequency | None = None, axis: Axis = 0, copy: bool | None = None) DataFrame
Convert DataFrame from DatetimeIndex to PeriodIndex.
Convert DataFrame from DatetimeIndex to PeriodIndex with desired frequency (inferred from index if not passed).
- Parameters:
freq (str, default) – Frequency of the PeriodIndex.
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to convert (the index by default).
copy (bool, default True) –
If False then underlying input data is not copied.
Note
The copy keyword will change behavior in pandas 3.0. Copy-on-Write will be enabled by default, which means that all methods with a copy keyword will use a lazy copy mechanism to defer the copy and ignore the copy keyword. The copy keyword will be removed in a future version of pandas.
You can already get the future behavior and improvements through enabling copy on write
pd.options.mode.copy_on_write = True
- Returns:
The DataFrame has a PeriodIndex.
- Return type:
Examples
>>> idx = pd.to_datetime( ... [ ... "2001-03-31 00:00:00", ... "2002-05-31 00:00:00", ... "2003-08-31 00:00:00", ... ] ... )
>>> idx DatetimeIndex(['2001-03-31', '2002-05-31', '2003-08-31'], dtype='datetime64[ns]', freq=None)
>>> idx.to_period("M") PeriodIndex(['2001-03', '2002-05', '2003-08'], dtype='period[M]')
For the yearly frequency
>>> idx.to_period("Y") PeriodIndex(['2001', '2002', '2003'], dtype='period[Y-DEC]')
- to_records(index: bool = True, column_dtypes=None, index_dtypes=None) recarray
Convert DataFrame to a NumPy record array.
Index will be included as the first field of the record array if requested.
- Parameters:
index (bool, default True) – Include index in resulting record array, stored in ‘index’ field or using the index label, if set.
column_dtypes (str, type, dict, default None) – If a string or type, the data type to store all columns. If a dictionary, a mapping of column names and indices (zero-indexed) to specific data types.
index_dtypes (str, type, dict, default None) –
If a string or type, the data type to store all index levels. If a dictionary, a mapping of index level names and indices (zero-indexed) to specific data types.
This mapping is applied only if index=True.
- Returns:
NumPy ndarray with the DataFrame labels as fields and each row of the DataFrame as entries.
- Return type:
numpy.rec.recarray
See also
DataFrame.from_recordsConvert structured or record ndarray to DataFrame.
numpy.rec.recarrayAn ndarray that allows field access using attributes, analogous to typed columns in a spreadsheet.
Examples
>>> df = pd.DataFrame({'A': [1, 2], 'B': [0.5, 0.75]}, ... index=['a', 'b']) >>> df A B a 1 0.50 b 2 0.75 >>> df.to_records() rec.array([('a', 1, 0.5 ), ('b', 2, 0.75)], dtype=[('index', 'O'), ('A', '<i8'), ('B', '<f8')])
If the DataFrame index has no label then the recarray field name is set to ‘index’. If the index has a label then this is used as the field name:
>>> df.index = df.index.rename("I") >>> df.to_records() rec.array([('a', 1, 0.5 ), ('b', 2, 0.75)], dtype=[('I', 'O'), ('A', '<i8'), ('B', '<f8')])
The index can be excluded from the record array:
>>> df.to_records(index=False) rec.array([(1, 0.5 ), (2, 0.75)], dtype=[('A', '<i8'), ('B', '<f8')])
Data types can be specified for the columns:
>>> df.to_records(column_dtypes={"A": "int32"}) rec.array([('a', 1, 0.5 ), ('b', 2, 0.75)], dtype=[('I', 'O'), ('A', '<i4'), ('B', '<f8')])
As well as for the index:
>>> df.to_records(index_dtypes="<S2") rec.array([(b'a', 1, 0.5 ), (b'b', 2, 0.75)], dtype=[('I', 'S2'), ('A', '<i8'), ('B', '<f8')])
>>> index_dtypes = f"<S{df.index.str.len().max()}" >>> df.to_records(index_dtypes=index_dtypes) rec.array([(b'a', 1, 0.5 ), (b'b', 2, 0.75)], dtype=[('I', 'S1'), ('A', '<i8'), ('B', '<f8')])
- to_stata(path: FilePath | WriteBuffer[bytes], *, convert_dates: dict[Hashable, str] | None = None, write_index: bool = True, byteorder: ToStataByteorder | None = None, time_stamp: datetime.datetime | None = None, data_label: str | None = None, variable_labels: dict[Hashable, str] | None = None, version: int | None = 114, convert_strl: Sequence[Hashable] | None = None, compression: CompressionOptions = 'infer', storage_options: StorageOptions | None = None, value_labels: dict[Hashable, dict[float, str]] | None = None) None
Export DataFrame object to Stata dta format.
Writes the DataFrame to a Stata dataset file. “dta” files contain a Stata dataset.
- Parameters:
path (str, path object, or buffer) – String, path object (implementing
os.PathLike[str]), or file-like object implementing a binarywrite()function.convert_dates (dict) – Dictionary mapping columns containing datetime types to stata internal format to use when writing the dates. Options are ‘tc’, ‘td’, ‘tm’, ‘tw’, ‘th’, ‘tq’, ‘ty’. Column can be either an integer or a name. Datetime columns that do not have a conversion type specified will be converted to ‘tc’. Raises NotImplementedError if a datetime column has timezone information.
write_index (bool) – Write the index to Stata dataset.
byteorder (str) – Can be “>”, “<”, “little”, or “big”. default is sys.byteorder.
time_stamp (datetime) – A datetime to use as file creation date. Default is the current time.
data_label (str, optional) – A label for the data set. Must be 80 characters or smaller.
variable_labels (dict) – Dictionary containing columns as keys and variable labels as values. Each label must be 80 characters or smaller.
version ({114, 117, 118, 119, None}, default 114) –
Version to use in the output dta file. Set to None to let pandas decide between 118 or 119 formats depending on the number of columns in the frame. Version 114 can be read by Stata 10 and later. Version 117 can be read by Stata 13 or later. Version 118 is supported in Stata 14 and later. Version 119 is supported in Stata 15 and later. Version 114 limits string variables to 244 characters or fewer while versions 117 and later allow strings with lengths up to 2,000,000 characters. Versions 118 and 119 support Unicode characters, and version 119 supports more than 32,767 variables.
Version 119 should usually only be used when the number of variables exceeds the capacity of dta format 118. Exporting smaller datasets in format 119 may have unintended consequences, and, as of November 2020, Stata SE cannot read version 119 files.
convert_strl (list, optional) – List of column names to convert to string columns to Stata StrL format. Only available if version is 117. Storing strings in the StrL format can produce smaller dta files if strings have more than 8 characters and values are repeated.
compression (str or dict, default 'infer') –
For on-the-fly compression of the output data. If ‘infer’ and ‘path’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). Set to
Nonefor no compression. Can also be a dict with key'method'set to one of {'zip','gzip','bz2','zstd','xz','tar'} and other key-value pairs are forwarded tozipfile.ZipFile,gzip.GzipFile,bz2.BZ2File,zstandard.ZstdCompressor,lzma.LZMAFileortarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive:compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.Added in version 1.5.0: Added support for .tar files.
Changed in version 1.4.0: Zstandard support.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Requestas header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open. Please seefsspecandurllibfor more details, and for more examples on storage options refer here.value_labels (dict of dicts) –
Dictionary containing columns as keys and dictionaries of column value to labels as values. Labels for a single variable must be 32,000 characters or smaller.
Added in version 1.4.0.
- Raises:
NotImplementedError –
If datetimes contain timezone information * Column dtype is not representable in Stata
ValueError –
Columns listed in convert_dates are neither datetime64[ns] or datetime.datetime * Column listed in convert_dates is not in DataFrame * Categorical label contains more than 32,000 characters
See also
read_stataImport Stata data files.
io.stata.StataWriterLow-level writer for Stata data files.
io.stata.StataWriter117Low-level writer for version 117 files.
Examples
>>> df = pd.DataFrame({'animal': ['falcon', 'parrot', 'falcon', ... 'parrot'], ... 'speed': [350, 18, 361, 15]}) >>> df.to_stata('animals.dta')
- to_string(buf: None = None, columns: Axes | None = None, col_space: int | list[int] | dict[Hashable, int] | None = None, header: bool | SequenceNotStr[str] = True, index: bool = True, na_rep: str = 'NaN', formatters: fmt.FormattersType | None = None, float_format: fmt.FloatFormatType | None = None, sparsify: bool | None = None, index_names: bool = True, justify: str | None = None, max_rows: int | None = None, max_cols: int | None = None, show_dimensions: bool = False, decimal: str = '.', line_width: int | None = None, min_rows: int | None = None, max_colwidth: int | None = None, encoding: str | None = None) str
- to_string(buf: FilePath | WriteBuffer[str], columns: Axes | None = None, col_space: int | list[int] | dict[Hashable, int] | None = None, header: bool | SequenceNotStr[str] = True, index: bool = True, na_rep: str = 'NaN', formatters: fmt.FormattersType | None = None, float_format: fmt.FloatFormatType | None = None, sparsify: bool | None = None, index_names: bool = True, justify: str | None = None, max_rows: int | None = None, max_cols: int | None = None, show_dimensions: bool = False, decimal: str = '.', line_width: int | None = None, min_rows: int | None = None, max_colwidth: int | None = None, encoding: str | None = None) None
Render a DataFrame to a console-friendly tabular output.
- Parameters:
buf (str, Path or StringIO-like, optional, default None) – Buffer to write to. If None, the output is returned as a string.
columns (array-like, optional, default None) – The subset of columns to write. Writes all columns by default.
col_space (int, list or dict of int, optional) – The minimum width of each column. If a list of ints is given every integers corresponds with one column. If a dict is given, the key references the column, while the value defines the space to use..
header (bool or list of str, optional) – Write out the column names. If a list of columns is given, it is assumed to be aliases for the column names.
index (bool, optional, default True) – Whether to print index (row) labels.
na_rep (str, optional, default 'NaN') – String representation of
NaNto use.formatters (list, tuple or dict of one-param. functions, optional) – Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List/tuple must be of length equal to the number of columns.
float_format (one-parameter function, optional, default None) – Formatter function to apply to columns’ elements if they are floats. This function must return a unicode string and will be applied only to the non-
NaNelements, withNaNbeing handled byna_rep.sparsify (bool, optional, default True) – Set to False for a DataFrame with a hierarchical index to print every multiindex key at each row.
index_names (bool, optional, default True) – Prints the names of the indexes.
justify (str, default None) –
How to justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box. Valid values are
left
right
center
justify
justify-all
start
end
inherit
match-parent
initial
unset.
max_rows (int, optional) – Maximum number of rows to display in the console.
max_cols (int, optional) – Maximum number of columns to display in the console.
show_dimensions (bool, default False) – Display DataFrame dimensions (number of rows by number of columns).
decimal (str, default '.') – Character recognized as decimal separator, e.g. ‘,’ in Europe.
line_width (int, optional) – Width to wrap a line in characters.
min_rows (int, optional) – The number of rows to display in the console in a truncated repr (when number of rows is above max_rows).
max_colwidth (int, optional) – Max width to truncate each column in characters. By default, no limit.
encoding (str, default "utf-8") – Set character encoding.
- Returns:
If buf is None, returns the result as a string. Otherwise returns None.
- Return type:
str or None
See also
to_htmlConvert DataFrame to HTML.
Examples
>>> d = {'col1': [1, 2, 3], 'col2': [4, 5, 6]} >>> df = pd.DataFrame(d) >>> print(df.to_string()) col1 col2 0 1 4 1 2 5 2 3 6
- to_timestamp(freq: Frequency | None = None, how: ToTimestampHow = 'start', axis: Axis = 0, copy: bool | None = None) DataFrame
Cast to DatetimeIndex of timestamps, at beginning of period.
- Parameters:
freq (str, default frequency of PeriodIndex) – Desired frequency.
how ({'s', 'e', 'start', 'end'}) – Convention for converting period to timestamp; start of period vs. end.
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to convert (the index by default).
copy (bool, default True) –
If False then underlying input data is not copied.
Note
The copy keyword will change behavior in pandas 3.0. Copy-on-Write will be enabled by default, which means that all methods with a copy keyword will use a lazy copy mechanism to defer the copy and ignore the copy keyword. The copy keyword will be removed in a future version of pandas.
You can already get the future behavior and improvements through enabling copy on write
pd.options.mode.copy_on_write = True
- Returns:
The DataFrame has a DatetimeIndex.
- Return type:
Examples
>>> idx = pd.PeriodIndex(['2023', '2024'], freq='Y') >>> d = {'col1': [1, 2], 'col2': [3, 4]} >>> df1 = pd.DataFrame(data=d, index=idx) >>> df1 col1 col2 2023 1 3 2024 2 4
The resulting timestamps will be at the beginning of the year in this case
>>> df1 = df1.to_timestamp() >>> df1 col1 col2 2023-01-01 1 3 2024-01-01 2 4 >>> df1.index DatetimeIndex(['2023-01-01', '2024-01-01'], dtype='datetime64[ns]', freq=None)
Using freq which is the offset that the Timestamps will have
>>> df2 = pd.DataFrame(data=d, index=idx) >>> df2 = df2.to_timestamp(freq='M') >>> df2 col1 col2 2023-01-31 1 3 2024-01-31 2 4 >>> df2.index DatetimeIndex(['2023-01-31', '2024-01-31'], dtype='datetime64[ns]', freq=None)
- to_xml(path_or_buffer: None = None, *, index: bool = True, root_name: str | None = 'data', row_name: str | None = 'row', na_rep: str | None = None, attr_cols: list[str] | None = None, elem_cols: list[str] | None = None, namespaces: dict[str | None, str] | None = None, prefix: str | None = None, encoding: str = 'utf-8', xml_declaration: bool | None = True, pretty_print: bool | None = True, parser: XMLParsers | None = 'lxml', stylesheet: FilePath | ReadBuffer[str] | ReadBuffer[bytes] | None = None, compression: CompressionOptions = 'infer', storage_options: StorageOptions | None = None) str
- to_xml(path_or_buffer: FilePath | WriteBuffer[bytes] | WriteBuffer[str], *, index: bool = True, root_name: str | None = 'data', row_name: str | None = 'row', na_rep: str | None = None, attr_cols: list[str] | None = None, elem_cols: list[str] | None = None, namespaces: dict[str | None, str] | None = None, prefix: str | None = None, encoding: str = 'utf-8', xml_declaration: bool | None = True, pretty_print: bool | None = True, parser: XMLParsers | None = 'lxml', stylesheet: FilePath | ReadBuffer[str] | ReadBuffer[bytes] | None = None, compression: CompressionOptions = 'infer', storage_options: StorageOptions | None = None) None
Render a DataFrame to an XML document.
Added in version 1.3.0.
- Parameters:
path_or_buffer (str, path object, file-like object, or None, default None) – String, path object (implementing
os.PathLike[str]), or file-like object implementing awrite()function. If None, the result is returned as a string.index (bool, default True) – Whether to include index in XML document.
root_name (str, default 'data') – The name of root element in XML document.
row_name (str, default 'row') – The name of row element in XML document.
na_rep (str, optional) – Missing data representation.
attr_cols (list-like, optional) – List of columns to write as attributes in row element. Hierarchical columns will be flattened with underscore delimiting the different levels.
elem_cols (list-like, optional) – List of columns to write as children in row element. By default, all columns output as children of row element. Hierarchical columns will be flattened with underscore delimiting the different levels.
namespaces (dict, optional) –
All namespaces to be defined in root element. Keys of dict should be prefix names and values of dict corresponding URIs. Default namespaces should be given empty string key. For example,
namespaces = {"": "https://example.com"}
prefix (str, optional) – Namespace prefix to be used for every element and/or attribute in document. This should be one of the keys in
namespacesdict.encoding (str, default 'utf-8') – Encoding of the resulting document.
xml_declaration (bool, default True) – Whether to include the XML declaration at start of document.
pretty_print (bool, default True) – Whether output should be pretty printed with indentation and line breaks.
parser ({'lxml','etree'}, default 'lxml') – Parser module to use for building of tree. Only ‘lxml’ and ‘etree’ are supported. With ‘lxml’, the ability to use XSLT stylesheet is supported.
stylesheet (str, path object or file-like object, optional) – A URL, file-like object, or a raw string containing an XSLT script used to transform the raw XML output. Script should use layout of elements and attributes from original output. This argument requires
lxmlto be installed. Only XSLT 1.0 scripts and not later versions is currently supported.compression (str or dict, default 'infer') –
For on-the-fly compression of the output data. If ‘infer’ and ‘path_or_buffer’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). Set to
Nonefor no compression. Can also be a dict with key'method'set to one of {'zip','gzip','bz2','zstd','xz','tar'} and other key-value pairs are forwarded tozipfile.ZipFile,gzip.GzipFile,bz2.BZ2File,zstandard.ZstdCompressor,lzma.LZMAFileortarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive:compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.Added in version 1.5.0: Added support for .tar files.
Changed in version 1.4.0: Zstandard support.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Requestas header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open. Please seefsspecandurllibfor more details, and for more examples on storage options refer here.
- Returns:
If
iois None, returns the resulting XML format as a string. Otherwise returns None.- Return type:
None or str
See also
to_jsonConvert the pandas object to a JSON string.
to_htmlConvert DataFrame to a html.
Examples
>>> df = pd.DataFrame({'shape': ['square', 'circle', 'triangle'], ... 'degrees': [360, 360, 180], ... 'sides': [4, np.nan, 3]})
>>> df.to_xml() <?xml version='1.0' encoding='utf-8'?> <data> <row> <index>0</index> <shape>square</shape> <degrees>360</degrees> <sides>4.0</sides> </row> <row> <index>1</index> <shape>circle</shape> <degrees>360</degrees> <sides/> </row> <row> <index>2</index> <shape>triangle</shape> <degrees>180</degrees> <sides>3.0</sides> </row> </data>
>>> df.to_xml(attr_cols=[ ... 'index', 'shape', 'degrees', 'sides' ... ]) <?xml version='1.0' encoding='utf-8'?> <data> <row index="0" shape="square" degrees="360" sides="4.0"/> <row index="1" shape="circle" degrees="360"/> <row index="2" shape="triangle" degrees="180" sides="3.0"/> </data>
>>> df.to_xml(namespaces={"doc": "https://example.com"}, ... prefix="doc") <?xml version='1.0' encoding='utf-8'?> <doc:data xmlns:doc="https://example.com"> <doc:row> <doc:index>0</doc:index> <doc:shape>square</doc:shape> <doc:degrees>360</doc:degrees> <doc:sides>4.0</doc:sides> </doc:row> <doc:row> <doc:index>1</doc:index> <doc:shape>circle</doc:shape> <doc:degrees>360</doc:degrees> <doc:sides/> </doc:row> <doc:row> <doc:index>2</doc:index> <doc:shape>triangle</doc:shape> <doc:degrees>180</doc:degrees> <doc:sides>3.0</doc:sides> </doc:row> </doc:data>
- transform(func: AggFuncType, axis: Axis = 0, *args, **kwargs) DataFrame
Call
funcon self producing a DataFrame with the same axis shape as self.- Parameters:
func (function, str, list-like or dict-like) –
Function to use for transforming the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. If func is both list-like and dict-like, dict-like behavior takes precedence.
Accepted combinations are:
function
string function name
list-like of functions and/or function names, e.g.
[np.exp, 'sqrt']dict-like of axis labels -> functions, function names or list-like of such.
axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.
*args – Positional arguments to pass to func.
**kwargs – Keyword arguments to pass to func.
- Returns:
A DataFrame that must have the same length as self.
- Return type:
:raises ValueError : If the returned DataFrame has a different length than self.:
See also
DataFrame.aggOnly perform aggregating type operations.
DataFrame.applyInvoke function on a DataFrame.
Notes
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.
Examples
>>> df = pd.DataFrame({'A': range(3), 'B': range(1, 4)}) >>> df A B 0 0 1 1 1 2 2 2 3 >>> df.transform(lambda x: x + 1) A B 0 1 2 1 2 3 2 3 4
Even though the resulting DataFrame must have the same length as the input DataFrame, it is possible to provide several input functions:
>>> s = pd.Series(range(3)) >>> s 0 0 1 1 2 2 dtype: int64 >>> s.transform([np.sqrt, np.exp]) sqrt exp 0 0.000000 1.000000 1 1.000000 2.718282 2 1.414214 7.389056
You can call transform on a GroupBy object:
>>> df = pd.DataFrame({ ... "Date": [ ... "2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05", ... "2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05"], ... "Data": [5, 8, 6, 1, 50, 100, 60, 120], ... }) >>> df Date Data 0 2015-05-08 5 1 2015-05-07 8 2 2015-05-06 6 3 2015-05-05 1 4 2015-05-08 50 5 2015-05-07 100 6 2015-05-06 60 7 2015-05-05 120 >>> df.groupby('Date')['Data'].transform('sum') 0 55 1 108 2 66 3 121 4 55 5 108 6 66 7 121 Name: Data, dtype: int64
>>> df = pd.DataFrame({ ... "c": [1, 1, 1, 2, 2, 2, 2], ... "type": ["m", "n", "o", "m", "m", "n", "n"] ... }) >>> df c type 0 1 m 1 1 n 2 1 o 3 2 m 4 2 m 5 2 n 6 2 n >>> df['size'] = df.groupby('c')['type'].transform(len) >>> df c type size 0 1 m 3 1 1 n 3 2 1 o 3 3 2 m 4 4 2 m 4 5 2 n 4 6 2 n 4
- transpose(*args, copy: bool = False) DataFrame
Transpose index and columns.
Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa. The property
Tis an accessor to the methodtranspose().- Parameters:
*args (tuple, optional) – Accepted for compatibility with NumPy.
copy (bool, default False) –
Whether to copy the data after transposing, even for DataFrames with a single dtype.
Note that a copy is always required for mixed dtype DataFrames, or for DataFrames with any extension types.
Note
The copy keyword will change behavior in pandas 3.0. Copy-on-Write will be enabled by default, which means that all methods with a copy keyword will use a lazy copy mechanism to defer the copy and ignore the copy keyword. The copy keyword will be removed in a future version of pandas.
You can already get the future behavior and improvements through enabling copy on write
pd.options.mode.copy_on_write = True
- Returns:
The transposed DataFrame.
- Return type:
See also
numpy.transposePermute the dimensions of a given array.
Notes
Transposing a DataFrame with mixed dtypes will result in a homogeneous DataFrame with the object dtype. In such a case, a copy of the data is always made.
Examples
Square DataFrame with homogeneous dtype
>>> d1 = {'col1': [1, 2], 'col2': [3, 4]} >>> df1 = pd.DataFrame(data=d1) >>> df1 col1 col2 0 1 3 1 2 4
>>> df1_transposed = df1.T # or df1.transpose() >>> df1_transposed 0 1 col1 1 2 col2 3 4
When the dtype is homogeneous in the original DataFrame, we get a transposed DataFrame with the same dtype:
>>> df1.dtypes col1 int64 col2 int64 dtype: object >>> df1_transposed.dtypes 0 int64 1 int64 dtype: object
Non-square DataFrame with mixed dtypes
>>> d2 = {'name': ['Alice', 'Bob'], ... 'score': [9.5, 8], ... 'employed': [False, True], ... 'kids': [0, 0]} >>> df2 = pd.DataFrame(data=d2) >>> df2 name score employed kids 0 Alice 9.5 False 0 1 Bob 8.0 True 0
>>> df2_transposed = df2.T # or df2.transpose() >>> df2_transposed 0 1 name Alice Bob score 9.5 8.0 employed False True kids 0 0
When the DataFrame has mixed dtypes, we get a transposed DataFrame with the object dtype:
>>> df2.dtypes name object score float64 employed bool kids int64 dtype: object >>> df2_transposed.dtypes 0 object 1 object dtype: object
- truediv(other, axis: Axis = 'columns', level=None, fill_value=None) DataFrame
Get Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to
dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.Among flexible wrappers (add, sub, mul, div, floordiv, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters:
other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns:
Result of the arithmetic operation.
- Return type:
See also
DataFrame.addAdd DataFrames.
DataFrame.subSubtract DataFrames.
DataFrame.mulMultiply DataFrames.
DataFrame.divDivide DataFrames (float division).
DataFrame.truedivDivide DataFrames (float division).
DataFrame.floordivDivide DataFrames (integer division).
DataFrame.modCalculate modulo (remainder after division).
DataFrame.powCalculate exponential power.
Notes
Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a dictionary by axis.
>>> df.mul({'angles': 0, 'degrees': 2}) angles degrees circle 0 720 triangle 0 360 rectangle 0 720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index') angles degrees circle 0 0 triangle 6 360 rectangle 12 1080
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- unstack(level: IndexLabel = -1, fill_value=None, sort: bool = True)
Pivot a level of the (necessarily hierarchical) index labels.
Returns a DataFrame having a new level of column labels whose inner-most level consists of the pivoted index labels.
If the index is not a MultiIndex, the output will be a Series (the analogue of stack when the columns are not a MultiIndex).
- Parameters:
level (int, str, or list of these, default -1 (last level)) – Level(s) of index to unstack, can pass level name.
fill_value (int, str or dict) – Replace NaN with this value if the unstack produces missing values.
sort (bool, default True) – Sort the level(s) in the resulting MultiIndex columns.
- Return type:
See also
DataFrame.pivotPivot a table based on column values.
DataFrame.stackPivot a level of the column labels (inverse operation from unstack).
Notes
Reference the user guide for more examples.
Examples
>>> index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'), ... ('two', 'a'), ('two', 'b')]) >>> s = pd.Series(np.arange(1.0, 5.0), index=index) >>> s one a 1.0 b 2.0 two a 3.0 b 4.0 dtype: float64
>>> s.unstack(level=-1) a b one 1.0 2.0 two 3.0 4.0
>>> s.unstack(level=0) one two a 1.0 3.0 b 2.0 4.0
>>> df = s.unstack(level=0) >>> df.unstack() one a 1.0 b 2.0 two a 3.0 b 4.0 dtype: float64
- update(other, join: UpdateJoin = 'left', overwrite: bool = True, filter_func=None, errors: IgnoreRaise = 'ignore') None
Modify in place using non-NA values from another DataFrame.
Aligns on indices. There is no return value.
- Parameters:
other (DataFrame, or object coercible into a DataFrame) – Should have at least one matching index/column label with the original DataFrame. If a Series is passed, its name attribute must be set, and that will be used as the column name to align with the original DataFrame.
join ({'left'}, default 'left') – Only left join is implemented, keeping the index and columns of the original object.
overwrite (bool, default True) –
How to handle non-NA values for overlapping keys:
True: overwrite original DataFrame’s values with values from other.
False: only update values that are NA in the original DataFrame.
filter_func (callable(1d-array) -> bool 1d-array, optional) – Can choose to replace values other than NA. Return True for values that should be updated.
errors ({'raise', 'ignore'}, default 'ignore') – If ‘raise’, will raise a ValueError if the DataFrame and other both contain non-NA data in the same place.
- Returns:
This method directly changes calling object.
- Return type:
None
- Raises:
ValueError –
When errors=’raise’ and there’s overlapping non-NA data. * When errors is not either ‘ignore’ or ‘raise’
NotImplementedError –
If join != ‘left’
See also
dict.updateSimilar method for dictionaries.
DataFrame.mergeFor column(s)-on-column(s) operations.
Examples
>>> df = pd.DataFrame({'A': [1, 2, 3], ... 'B': [400, 500, 600]}) >>> new_df = pd.DataFrame({'B': [4, 5, 6], ... 'C': [7, 8, 9]}) >>> df.update(new_df) >>> df A B 0 1 4 1 2 5 2 3 6
The DataFrame’s length does not increase as a result of the update, only values at matching index/column labels are updated.
>>> df = pd.DataFrame({'A': ['a', 'b', 'c'], ... 'B': ['x', 'y', 'z']}) >>> new_df = pd.DataFrame({'B': ['d', 'e', 'f', 'g', 'h', 'i']}) >>> df.update(new_df) >>> df A B 0 a d 1 b e 2 c f
>>> df = pd.DataFrame({'A': ['a', 'b', 'c'], ... 'B': ['x', 'y', 'z']}) >>> new_df = pd.DataFrame({'B': ['d', 'f']}, index=[0, 2]) >>> df.update(new_df) >>> df A B 0 a d 1 b y 2 c f
For Series, its name attribute must be set.
>>> df = pd.DataFrame({'A': ['a', 'b', 'c'], ... 'B': ['x', 'y', 'z']}) >>> new_column = pd.Series(['d', 'e', 'f'], name='B') >>> df.update(new_column) >>> df A B 0 a d 1 b e 2 c f
If other contains NaNs the corresponding values are not updated in the original dataframe.
>>> df = pd.DataFrame({'A': [1, 2, 3], ... 'B': [400., 500., 600.]}) >>> new_df = pd.DataFrame({'B': [4, np.nan, 6]}) >>> df.update(new_df) >>> df A B 0 1 4.0 1 2 500.0 2 3 6.0
- value_counts(subset: IndexLabel | None = None, normalize: bool = False, sort: bool = True, ascending: bool = False, dropna: bool = True) Series
Return a Series containing the frequency of each distinct row in the Dataframe.
- Parameters:
subset (label or list of labels, optional) – Columns to use when counting unique combinations.
normalize (bool, default False) – Return proportions rather than frequencies.
sort (bool, default True) – Sort by frequencies when True. Sort by DataFrame column values when False.
ascending (bool, default False) – Sort in ascending order.
dropna (bool, default True) –
Don’t include counts of rows that contain NA values.
Added in version 1.3.0.
- Return type:
See also
Series.value_countsEquivalent method on Series.
Notes
The returned Series will have a MultiIndex with one level per input column but an Index (non-multi) for a single label. By default, rows that contain any NA values are omitted from the result. By default, the resulting Series will be in descending order so that the first element is the most frequently-occurring row.
Examples
>>> df = pd.DataFrame({'num_legs': [2, 4, 4, 6], ... 'num_wings': [2, 0, 0, 0]}, ... index=['falcon', 'dog', 'cat', 'ant']) >>> df num_legs num_wings falcon 2 2 dog 4 0 cat 4 0 ant 6 0
>>> df.value_counts() num_legs num_wings 4 0 2 2 2 1 6 0 1 Name: count, dtype: int64
>>> df.value_counts(sort=False) num_legs num_wings 2 2 1 4 0 2 6 0 1 Name: count, dtype: int64
>>> df.value_counts(ascending=True) num_legs num_wings 2 2 1 6 0 1 4 0 2 Name: count, dtype: int64
>>> df.value_counts(normalize=True) num_legs num_wings 4 0 0.50 2 2 0.25 6 0 0.25 Name: proportion, dtype: float64
With dropna set to False we can also count rows with NA values.
>>> df = pd.DataFrame({'first_name': ['John', 'Anne', 'John', 'Beth'], ... 'middle_name': ['Smith', pd.NA, pd.NA, 'Louise']}) >>> df first_name middle_name 0 John Smith 1 Anne <NA> 2 John <NA> 3 Beth Louise
>>> df.value_counts() first_name middle_name Beth Louise 1 John Smith 1 Name: count, dtype: int64
>>> df.value_counts(dropna=False) first_name middle_name Anne NaN 1 Beth Louise 1 John Smith 1 NaN 1 Name: count, dtype: int64
>>> df.value_counts("first_name") first_name John 2 Anne 1 Beth 1 Name: count, dtype: int64
- property values: ndarray
Return a Numpy representation of the DataFrame.
Warning
We recommend using
DataFrame.to_numpy()instead.Only the values in the DataFrame will be returned, the axes labels will be removed.
- Returns:
The values of the DataFrame.
- Return type:
See also
DataFrame.to_numpyRecommended alternative to this method.
DataFrame.indexRetrieve the index labels.
DataFrame.columnsRetrieving the column names.
Notes
The dtype will be a lower-common-denominator dtype (implicit upcasting); that is to say if the dtypes (even of numeric types) are mixed, the one that accommodates all will be chosen. Use this with care if you are not dealing with the blocks.
e.g. If the dtypes are float16 and float32, dtype will be upcast to float32. If dtypes are int32 and uint8, dtype will be upcast to int32. By
numpy.find_common_type()convention, mixing int64 and uint64 will result in a float64 dtype.Examples
A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.
>>> df = pd.DataFrame({'age': [ 3, 29], ... 'height': [94, 170], ... 'weight': [31, 115]}) >>> df age height weight 0 3 94 31 1 29 170 115 >>> df.dtypes age int64 height int64 weight int64 dtype: object >>> df.values array([[ 3, 94, 31], [ 29, 170, 115]])
A DataFrame with mixed type columns(e.g., str/object, int64, float32) results in an ndarray of the broadest type that accommodates these mixed types (e.g., object).
>>> df2 = pd.DataFrame([('parrot', 24.0, 'second'), ... ('lion', 80.5, 1), ... ('monkey', np.nan, None)], ... columns=('name', 'max_speed', 'rank')) >>> df2.dtypes name object max_speed float64 rank object dtype: object >>> df2.values array([['parrot', 24.0, 'second'], ['lion', 80.5, 1], ['monkey', nan, None]], dtype=object)
- var(axis: Axis | None = 0, skipna: bool = True, ddof: int = 1, numeric_only: bool = False, **kwargs)
Return unbiased variance over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument.
- Parameters:
axis ({index (0), columns (1)}) –
For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.var with
axis=Noneis deprecated, in a future version this will reduce over both axes and return a scalar To retain the old behavior, pass axis=0 (or do not pass axis).skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
- Return type:
Examples
>>> df = pd.DataFrame({'person_id': [0, 1, 2, 3], ... 'age': [21, 25, 62, 43], ... 'height': [1.61, 1.87, 1.49, 2.01]} ... ).set_index('person_id') >>> df age height person_id 0 21 1.61 1 25 1.87 2 62 1.49 3 43 2.01
>>> df.var() age 352.916667 height 0.056367 dtype: float64
Alternatively,
ddof=0can be set to normalize by N instead of N-1:>>> df.var(ddof=0) age 264.687500 height 0.042275 dtype: float64
- class pyranges1.ext.stats.Series(data=None, index=None, dtype: Dtype | None = None, name=None, copy: bool | None = None, fastpath: bool | lib.NoDefault = <no_default>)
One-dimensional ndarray with axis labels (including time series).
Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN).
Operations between Series (+, -, /, *, **) align values based on their associated index values– they need not be the same length. The result index will be the sorted union of the two indexes.
- Parameters:
data (array-like, Iterable, dict, or scalar value) – Contains data stored in Series. If data is a dict, argument order is maintained.
index (array-like or Index (1d)) – Values must be hashable and have the same length as data. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, …, n) if not provided. If data is dict-like and index is None, then the keys in the data are used as the index. If the index is not None, the resulting Series is reindexed with the index values.
dtype (str, numpy.dtype, or ExtensionDtype, optional) – Data type for the output Series. If not specified, this will be inferred from data. See the user guide for more usages.
name (Hashable, default None) – The name to give to the Series.
copy (bool, default False) – Copy input data. Only affects Series or 1d ndarray input. See examples.
Notes
Please reference the User Guide for more information.
Examples
Constructing Series from a dictionary with an Index specified
>>> d = {'a': 1, 'b': 2, 'c': 3} >>> ser = pd.Series(data=d, index=['a', 'b', 'c']) >>> ser a 1 b 2 c 3 dtype: int64
The keys of the dictionary match with the Index values, hence the Index values have no effect.
>>> d = {'a': 1, 'b': 2, 'c': 3} >>> ser = pd.Series(data=d, index=['x', 'y', 'z']) >>> ser x NaN y NaN z NaN dtype: float64
Note that the Index is first build with the keys from the dictionary. After this the Series is reindexed with the given Index values, hence we get all NaN as a result.
Constructing Series from a list with copy=False.
>>> r = [1, 2] >>> ser = pd.Series(r, copy=False) >>> ser.iloc[0] = 999 >>> r [1, 2] >>> ser 0 999 1 2 dtype: int64
Due to input data type the Series has a copy of the original data even though copy=False, so the data is unchanged.
Constructing Series from a 1d ndarray with copy=False.
>>> r = np.array([1, 2]) >>> ser = pd.Series(r, copy=False) >>> ser.iloc[0] = 999 >>> r array([999, 2]) >>> ser 0 999 1 2 dtype: int64
Due to input data type the Series has a view on the original data, so the data is changed as well.
- add(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Addition of series and other, element-wise (binary operator add).
Equivalent to
series + other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.raddReverse of the Addition operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
- agg(func=None, axis: Axis = 0, *args, **kwargs)
Aggregate using one or more operations over the specified axis.
- Parameters:
func (function, str, list or dict) –
Function to use for aggregating the data. If a function, must either work when passed a Series or when passed to Series.apply.
Accepted combinations are:
function
string function name
list of functions and/or function names, e.g.
[np.sum, 'mean']dict of axis labels -> functions, function names or list of such.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
*args – Positional arguments to pass to func.
**kwargs – Keyword arguments to pass to func.
- Returns:
The return can be:
scalar : when Series.agg is called with single function
Series : when DataFrame.agg is called with a single function
DataFrame : when DataFrame.agg is called with several functions
- Return type:
See also
Series.applyInvoke function on a Series.
Series.transformTransform function producing a Series with like indexes.
Notes
The aggregation operations are always performed over an axis, either the index (default) or the column axis. This behavior is different from numpy aggregation functions (mean, median, prod, sum, std, var), where the default is to compute the aggregation of the flattened array, e.g.,
numpy.mean(arr_2d)as opposed tonumpy.mean(arr_2d, axis=0).agg is an alias for aggregate. Use the alias.
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.
A passed user-defined-function will be passed a Series for evaluation.
Examples
>>> s = pd.Series([1, 2, 3, 4]) >>> s 0 1 1 2 2 3 3 4 dtype: int64
>>> s.agg('min') 1
>>> s.agg(['min', 'max']) min 1 max 4 dtype: int64
- aggregate(func=None, axis: Axis = 0, *args, **kwargs)
Aggregate using one or more operations over the specified axis.
- Parameters:
func (function, str, list or dict) –
Function to use for aggregating the data. If a function, must either work when passed a Series or when passed to Series.apply.
Accepted combinations are:
function
string function name
list of functions and/or function names, e.g.
[np.sum, 'mean']dict of axis labels -> functions, function names or list of such.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
*args – Positional arguments to pass to func.
**kwargs – Keyword arguments to pass to func.
- Returns:
The return can be:
scalar : when Series.agg is called with single function
Series : when DataFrame.agg is called with a single function
DataFrame : when DataFrame.agg is called with several functions
- Return type:
See also
Series.applyInvoke function on a Series.
Series.transformTransform function producing a Series with like indexes.
Notes
The aggregation operations are always performed over an axis, either the index (default) or the column axis. This behavior is different from numpy aggregation functions (mean, median, prod, sum, std, var), where the default is to compute the aggregation of the flattened array, e.g.,
numpy.mean(arr_2d)as opposed tonumpy.mean(arr_2d, axis=0).agg is an alias for aggregate. Use the alias.
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.
A passed user-defined-function will be passed a Series for evaluation.
Examples
>>> s = pd.Series([1, 2, 3, 4]) >>> s 0 1 1 2 2 3 3 4 dtype: int64
>>> s.agg('min') 1
>>> s.agg(['min', 'max']) min 1 max 4 dtype: int64
- all(axis: Axis = 0, bool_only: bool = False, skipna: bool = True, **kwargs) bool
Return whether all elements are True, potentially over an axis.
Returns True unless there at least one element within a series or along a Dataframe axis that is False or equivalent (e.g. zero or empty).
- Parameters:
axis ({0 or 'index', 1 or 'columns', None}, default 0) –
Indicate which axis or axes should be reduced. For Series this parameter is unused and defaults to 0.
0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
None : reduce all axes, return a scalar.
bool_only (bool, default False) – Include only boolean columns. Not implemented for Series.
skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be True, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
**kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
If level is specified, then, Series is returned; otherwise, scalar is returned.
- Return type:
scalar or Series
See also
Series.allReturn True if all elements are True.
DataFrame.anyReturn True if one (or more) elements are True.
Examples
Series
>>> pd.Series([True, True]).all() True >>> pd.Series([True, False]).all() False >>> pd.Series([], dtype="float64").all() True >>> pd.Series([np.nan]).all() True >>> pd.Series([np.nan]).all(skipna=False) True
DataFrames
Create a dataframe from a dictionary.
>>> df = pd.DataFrame({'col1': [True, True], 'col2': [True, False]}) >>> df col1 col2 0 True True 1 True False
Default behaviour checks if values in each column all return True.
>>> df.all() col1 True col2 False dtype: bool
Specify
axis='columns'to check if values in each row all return True.>>> df.all(axis='columns') 0 True 1 False dtype: bool
Or
axis=Nonefor whether every value is True.>>> df.all(axis=None) False
- any(*, axis: Axis = 0, bool_only: bool = False, skipna: bool = True, **kwargs) bool
Return whether any element is True, potentially over an axis.
Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).
- Parameters:
axis ({0 or 'index', 1 or 'columns', None}, default 0) –
Indicate which axis or axes should be reduced. For Series this parameter is unused and defaults to 0.
0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
None : reduce all axes, return a scalar.
bool_only (bool, default False) – Include only boolean columns. Not implemented for Series.
skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
**kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
If level is specified, then, Series is returned; otherwise, scalar is returned.
- Return type:
scalar or Series
See also
numpy.anyNumpy version of this method.
Series.anyReturn whether any element is True.
Series.allReturn whether all elements are True.
DataFrame.anyReturn whether any element is True over requested axis.
DataFrame.allReturn whether all elements are True over requested axis.
Examples
Series
For Series input, the output is a scalar indicating whether any element is True.
>>> pd.Series([False, False]).any() False >>> pd.Series([True, False]).any() True >>> pd.Series([], dtype="float64").any() False >>> pd.Series([np.nan]).any() False >>> pd.Series([np.nan]).any(skipna=False) True
DataFrame
Whether each column contains at least one True element (the default).
>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]}) >>> df A B C 0 1 0 0 1 2 2 0
>>> df.any() A True B True C False dtype: bool
Aggregating over the columns.
>>> df = pd.DataFrame({"A": [True, False], "B": [1, 2]}) >>> df A B 0 True 1 1 False 2
>>> df.any(axis='columns') 0 True 1 True dtype: bool
>>> df = pd.DataFrame({"A": [True, False], "B": [1, 0]}) >>> df A B 0 True 1 1 False 0
>>> df.any(axis='columns') 0 True 1 False dtype: bool
Aggregating over the entire DataFrame with
axis=None.>>> df.any(axis=None) True
any for an empty DataFrame is an empty Series.
>>> pd.DataFrame([]).any() Series([], dtype: bool)
- apply(func: AggFuncType, convert_dtype: bool | lib.NoDefault = <no_default>, args: tuple[Any, ...] = (), *, by_row: Literal[False, 'compat'] = 'compat', **kwargs) DataFrame | Series
Invoke function on values of Series.
Can be ufunc (a NumPy function that applies to the entire Series) or a Python function that only works on single values.
- Parameters:
func (function) – Python function or NumPy ufunc to apply.
convert_dtype (bool, default True) –
Try to find better dtype for elementwise function results. If False, leave as dtype=object. Note that the dtype is always preserved for some extension array dtypes, such as Categorical.
Deprecated since version 2.1.0:
convert_dtypehas been deprecated. Doser.astype(object).apply()instead if you wantconvert_dtype=False.args (tuple) – Positional arguments passed to func after the series value.
by_row (False or "compat", default "compat") –
If
"compat"and func is a callable, func will be passed each element of the Series, likeSeries.map. If func is a list or dict of callables, will first try to translate each func into pandas methods. If that doesn’t work, will try call to apply again withby_row="compat"and if that fails, will call apply again withby_row=False(backward compatible). If False, the func will be passed the whole Series at once.by_rowhas no effect whenfuncis a string.Added in version 2.1.0.
**kwargs – Additional keyword arguments passed to func.
- Returns:
If func returns a Series object the result will be a DataFrame.
- Return type:
See also
Series.mapFor element-wise operations.
Series.aggOnly perform aggregating type operations.
Series.transformOnly perform transforming type operations.
Notes
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.
Examples
Create a series with typical summer temperatures for each city.
>>> s = pd.Series([20, 21, 12], ... index=['London', 'New York', 'Helsinki']) >>> s London 20 New York 21 Helsinki 12 dtype: int64
Square the values by defining a function and passing it as an argument to
apply().>>> def square(x): ... return x ** 2 >>> s.apply(square) London 400 New York 441 Helsinki 144 dtype: int64
Square the values by passing an anonymous function as an argument to
apply().>>> s.apply(lambda x: x ** 2) London 400 New York 441 Helsinki 144 dtype: int64
Define a custom function that needs additional positional arguments and pass these additional arguments using the
argskeyword.>>> def subtract_custom_value(x, custom_value): ... return x - custom_value
>>> s.apply(subtract_custom_value, args=(5,)) London 15 New York 16 Helsinki 7 dtype: int64
Define a custom function that takes keyword arguments and pass these arguments to
apply.>>> def add_custom_values(x, **kwargs): ... for month in kwargs: ... x += kwargs[month] ... return x
>>> s.apply(add_custom_values, june=30, july=20, august=25) London 95 New York 96 Helsinki 87 dtype: int64
Use a function from the Numpy library.
>>> s.apply(np.log) London 2.995732 New York 3.044522 Helsinki 2.484907 dtype: float64
- argsort(axis: Axis = 0, kind: SortKind = 'quicksort', order: None = None, stable: None = None) Series
Return the integer indices that would sort the Series values.
Override ndarray.argsort. Argsorts the value, omitting NA/null values, and places the result in the same locations as the non-NA values.
- Parameters:
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
kind ({'mergesort', 'quicksort', 'heapsort', 'stable'}, default 'quicksort') – Choice of sorting algorithm. See
numpy.sort()for more information. ‘mergesort’ and ‘stable’ are the only stable algorithms.order (None) – Has no effect but is accepted for compatibility with numpy.
stable (None) – Has no effect but is accepted for compatibility with numpy.
- Returns:
Positions of values within the sort order with -1 indicating nan values.
- Return type:
Series[np.intp]
See also
numpy.ndarray.argsortReturns the indices that would sort this array.
Examples
>>> s = pd.Series([3, 2, 1]) >>> s.argsort() 0 2 1 1 2 0 dtype: int64
- property array: ExtensionArray
The ExtensionArray of the data backing this Series or Index.
- Returns:
An ExtensionArray of the values stored within. For extension types, this is the actual array. For NumPy native types, this is a thin (no copy) wrapper around
numpy.ndarray..arraydiffers from.values, which may require converting the data to a different form.- Return type:
ExtensionArray
See also
Index.to_numpySimilar method that always returns a NumPy array.
Series.to_numpySimilar method that always returns a NumPy array.
Notes
This table lays out the different array types for each extension dtype within pandas.
dtype
array type
category
Categorical
period
PeriodArray
interval
IntervalArray
IntegerNA
IntegerArray
string
StringArray
boolean
BooleanArray
datetime64[ns, tz]
DatetimeArray
For any 3rd-party extension types, the array type will be an ExtensionArray.
For all remaining dtypes
.arraywill be aarrays.NumpyExtensionArraywrapping the actual ndarray stored within. If you absolutely need a NumPy array (possibly with copying / coercing data), then useSeries.to_numpy()instead.Examples
For regular NumPy types like int, and float, a NumpyExtensionArray is returned.
>>> pd.Series([1, 2, 3]).array <NumpyExtensionArray> [1, 2, 3] Length: 3, dtype: int64
For extension types, like Categorical, the actual ExtensionArray is returned
>>> ser = pd.Series(pd.Categorical(['a', 'b', 'a'])) >>> ser.array ['a', 'b', 'a'] Categories (2, object): ['a', 'b']
- autocorr(lag: int = 1) float
Compute the lag-N autocorrelation.
This method computes the Pearson correlation between the Series and its shifted self.
- Parameters:
lag (int, default 1) – Number of lags to apply before performing autocorrelation.
- Returns:
The Pearson correlation between self and self.shift(lag).
- Return type:
float
See also
Series.corrCompute the correlation between two Series.
Series.shiftShift index by desired number of periods.
DataFrame.corrCompute pairwise correlation of columns.
DataFrame.corrwithCompute pairwise correlation between rows or columns of two DataFrame objects.
Notes
If the Pearson correlation is not well defined return ‘NaN’.
Examples
>>> s = pd.Series([0.25, 0.5, 0.2, -0.05]) >>> s.autocorr() 0.10355... >>> s.autocorr(lag=2) -0.99999...
If the Pearson correlation is not well defined, then ‘NaN’ is returned.
>>> s = pd.Series([1, 0, 0, 0]) >>> s.autocorr() nan
- property axes: list[Index]
Return a list of the row axis labels.
- between(left, right, inclusive: Literal['both', 'neither', 'left', 'right'] = 'both') Series
Return boolean Series equivalent to left <= series <= right.
This function returns a boolean vector containing True wherever the corresponding Series element is between the boundary values left and right. NA values are treated as False.
- Parameters:
left (scalar or list-like) – Left boundary.
right (scalar or list-like) – Right boundary.
inclusive ({"both", "neither", "left", "right"}) –
Include boundaries. Whether to set each bound as closed or open.
Changed in version 1.3.0.
- Returns:
Series representing whether each element is between left and right (inclusive).
- Return type:
Notes
This function is equivalent to
(left <= ser) & (ser <= right)Examples
>>> s = pd.Series([2, 0, 4, 8, np.nan])
Boundary values are included by default:
>>> s.between(1, 4) 0 True 1 False 2 True 3 False 4 False dtype: bool
With inclusive set to
"neither"boundary values are excluded:>>> s.between(1, 4, inclusive="neither") 0 True 1 False 2 False 3 False 4 False dtype: bool
left and right can be any scalar value:
>>> s = pd.Series(['Alice', 'Bob', 'Carol', 'Eve']) >>> s.between('Anna', 'Daniel') 0 False 1 True 2 True 3 False dtype: bool
- case_when(caselist: list[tuple[ArrayLike | Callable[[Series], Series | np.ndarray | Sequence[bool]], ArrayLike | Scalar | Callable[[Series], Series | np.ndarray]]]) Series
Replace values where the conditions are True.
- Parameters:
caselist (A list of tuples of conditions and expected replacements) –
Takes the form:
(condition0, replacement0),(condition1, replacement1), … .conditionshould be a 1-D boolean array-like object or a callable. Ifconditionis a callable, it is computed on the Series and should return a boolean Series or array. The callable must not change the input Series (though pandas doesn`t check it).replacementshould be a 1-D array-like object, a scalar or a callable. Ifreplacementis a callable, it is computed on the Series and should return a scalar or Series. The callable must not change the input Series (though pandas doesn`t check it).Added in version 2.2.0.
- Return type:
See also
Series.maskReplace values where the condition is True.
Examples
>>> c = pd.Series([6, 7, 8, 9], name='c') >>> a = pd.Series([0, 0, 1, 2]) >>> b = pd.Series([0, 3, 4, 5])
>>> c.case_when(caselist=[(a.gt(0), a), # condition, replacement ... (b.gt(0), b)]) 0 6 1 3 2 1 3 2 Name: c, dtype: int64
- cat
alias of
CategoricalAccessor
- combine(other: Series | Hashable, func: Callable[[Hashable, Hashable], Hashable], fill_value: Hashable | None = None) Series
Combine the Series with a Series or scalar according to func.
Combine the Series and other using func to perform elementwise selection for combined Series. fill_value is assumed when value is missing at some index from one of the two objects being combined.
- Parameters:
other (Series or scalar) – The value(s) to be combined with the Series.
func (function) – Function that takes two scalars as inputs and returns an element.
fill_value (scalar, optional) – The value to assume when an index is missing from one Series or the other. The default specifies to use the appropriate NaN value for the underlying dtype of the Series.
- Returns:
The result of combining the Series with the other object.
- Return type:
See also
Series.combine_firstCombine Series values, choosing the calling Series’ values first.
Examples
Consider 2 Datasets
s1ands2containing highest clocked speeds of different birds.>>> s1 = pd.Series({'falcon': 330.0, 'eagle': 160.0}) >>> s1 falcon 330.0 eagle 160.0 dtype: float64 >>> s2 = pd.Series({'falcon': 345.0, 'eagle': 200.0, 'duck': 30.0}) >>> s2 falcon 345.0 eagle 200.0 duck 30.0 dtype: float64
Now, to combine the two datasets and view the highest speeds of the birds across the two datasets
>>> s1.combine(s2, max) duck NaN eagle 200.0 falcon 345.0 dtype: float64
In the previous example, the resulting value for duck is missing, because the maximum of a NaN and a float is a NaN. So, in the example, we set
fill_value=0, so the maximum value returned will be the value from some dataset.>>> s1.combine(s2, max, fill_value=0) duck 30.0 eagle 200.0 falcon 345.0 dtype: float64
- combine_first(other) Series
Update null elements with value in the same location in ‘other’.
Combine two Series objects by filling null values in one Series with non-null values from the other Series. Result index will be the union of the two indexes.
- Parameters:
other (Series) – The value(s) to be used for filling null values.
- Returns:
The result of combining the provided Series with the other object.
- Return type:
See also
Series.combinePerform element-wise operation on two Series using a given function.
Examples
>>> s1 = pd.Series([1, np.nan]) >>> s2 = pd.Series([3, 4, 5]) >>> s1.combine_first(s2) 0 1.0 1 4.0 2 5.0 dtype: float64
Null values still persist if the location of that null value does not exist in other
>>> s1 = pd.Series({'falcon': np.nan, 'eagle': 160.0}) >>> s2 = pd.Series({'eagle': 200.0, 'duck': 30.0}) >>> s1.combine_first(s2) duck 30.0 eagle 160.0 falcon NaN dtype: float64
- compare(other: Series, align_axis: Axis = 1, keep_shape: bool = False, keep_equal: bool = False, result_names: Suffixes = ('self', 'other')) DataFrame | Series
Compare to another Series and show the differences.
- Parameters:
other (Series) – Object to compare with.
align_axis ({0 or 'index', 1 or 'columns'}, default 1) –
Determine which axis to align the comparison on.
- 0, or ‘index’Resulting differences are stacked vertically
with rows drawn alternately from self and other.
- 1, or ‘columns’Resulting differences are aligned horizontally
with columns drawn alternately from self and other.
keep_shape (bool, default False) – If true, all rows and columns are kept. Otherwise, only the ones with different values are kept.
keep_equal (bool, default False) – If true, the result keeps values that are equal. Otherwise, equal values are shown as NaNs.
result_names (tuple, default ('self', 'other')) –
Set the dataframes names in the comparison.
Added in version 1.5.0.
- Returns:
If axis is 0 or ‘index’ the result will be a Series. The resulting index will be a MultiIndex with ‘self’ and ‘other’ stacked alternately at the inner level.
If axis is 1 or ‘columns’ the result will be a DataFrame. It will have two columns namely ‘self’ and ‘other’.
- Return type:
See also
DataFrame.compareCompare with another DataFrame and show differences.
Notes
Matching NaNs will not appear as a difference.
Examples
>>> s1 = pd.Series(["a", "b", "c", "d", "e"]) >>> s2 = pd.Series(["a", "a", "c", "b", "e"])
Align the differences on columns
>>> s1.compare(s2) self other 1 b a 3 d b
Stack the differences on indices
>>> s1.compare(s2, align_axis=0) 1 self b other a 3 self d other b dtype: object
Keep all original rows
>>> s1.compare(s2, keep_shape=True) self other 0 NaN NaN 1 b a 2 NaN NaN 3 d b 4 NaN NaN
Keep all original rows and also all original values
>>> s1.compare(s2, keep_shape=True, keep_equal=True) self other 0 a a 1 b a 2 c c 3 d b 4 e e
- corr(other: Series, method: CorrelationMethod = 'pearson', min_periods: int | None = None) float
Compute correlation with other Series, excluding missing values.
The two Series objects are not required to be the same length and will be aligned internally before the correlation function is applied.
- Parameters:
other (Series) – Series with which to compute the correlation.
method ({'pearson', 'kendall', 'spearman'} or callable) –
Method used to compute correlation:
pearson : Standard correlation coefficient
kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation
callable: Callable with input two 1d ndarrays and returning a float.
Warning
Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.
min_periods (int, optional) – Minimum number of observations needed to have a valid result.
- Returns:
Correlation with other.
- Return type:
float
See also
DataFrame.corrCompute pairwise correlation between columns.
DataFrame.corrwithCompute pairwise correlation with another DataFrame or Series.
Notes
Pearson, Kendall and Spearman correlation are currently computed using pairwise complete observations.
Automatic data alignment: as with all pandas operations, automatic data alignment is performed for this method.
corr()automatically considers values with matching indices.Examples
>>> def histogram_intersection(a, b): ... v = np.minimum(a, b).sum().round(decimals=1) ... return v >>> s1 = pd.Series([.2, .0, .6, .2]) >>> s2 = pd.Series([.3, .6, .0, .1]) >>> s1.corr(s2, method=histogram_intersection) 0.3
Pandas auto-aligns the values with matching indices
>>> s1 = pd.Series([1, 2, 3], index=[0, 1, 2]) >>> s2 = pd.Series([1, 2, 3], index=[2, 1, 0]) >>> s1.corr(s2) -1.0
- count() int
Return number of non-NA/null observations in the Series.
- Returns:
Number of non-null values in the Series.
- Return type:
int
See also
DataFrame.countCount non-NA cells for each column or row.
Examples
>>> s = pd.Series([0.0, 1.0, np.nan]) >>> s.count() 2
- cov(other: Series, min_periods: int | None = None, ddof: int | None = 1) float
Compute covariance with Series, excluding missing values.
The two Series objects are not required to be the same length and will be aligned internally before the covariance is calculated.
- Parameters:
other (Series) – Series with which to compute the covariance.
min_periods (int, optional) – Minimum number of observations needed to have a valid result.
ddof (int, default 1) – Delta degrees of freedom. The divisor used in calculations is
N - ddof, whereNrepresents the number of elements.
- Returns:
Covariance between Series and other normalized by N-1 (unbiased estimator).
- Return type:
float
See also
DataFrame.covCompute pairwise covariance of columns.
Examples
>>> s1 = pd.Series([0.90010907, 0.13484424, 0.62036035]) >>> s2 = pd.Series([0.12528585, 0.26962463, 0.51111198]) >>> s1.cov(s2) -0.01685762652715874
- cummax(axis: Axis | None = None, skipna: bool = True, *args, **kwargs)
Return cumulative maximum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative maximum.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args – Additional keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
Return cumulative maximum of scalar or Series.
- Return type:
scalar or Series
See also
core.window.expanding.Expanding.maxSimilar functionality but ignores
NaNvalues.Series.maxReturn the maximum over Series axis.
Series.cummaxReturn cumulative maximum over Series axis.
Series.cumminReturn cumulative minimum over Series axis.
Series.cumsumReturn cumulative sum over Series axis.
Series.cumprodReturn cumulative product over Series axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cummax() 0 2.0 1 NaN 2 5.0 3 5.0 4 5.0 dtype: float64
To include NA values in the operation, use
skipna=False>>> s.cummax(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the maximum in each column. This is equivalent to
axis=Noneoraxis='index'.>>> df.cummax() A B 0 2.0 1.0 1 3.0 NaN 2 3.0 1.0
To iterate over columns and find the maximum in each row, use
axis=1>>> df.cummax(axis=1) A B 0 2.0 2.0 1 3.0 NaN 2 1.0 1.0
- cummin(axis: Axis | None = None, skipna: bool = True, *args, **kwargs)
Return cumulative minimum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative minimum.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args – Additional keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
Return cumulative minimum of scalar or Series.
- Return type:
scalar or Series
See also
core.window.expanding.Expanding.minSimilar functionality but ignores
NaNvalues.Series.minReturn the minimum over Series axis.
Series.cummaxReturn cumulative maximum over Series axis.
Series.cumminReturn cumulative minimum over Series axis.
Series.cumsumReturn cumulative sum over Series axis.
Series.cumprodReturn cumulative product over Series axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cummin() 0 2.0 1 NaN 2 2.0 3 -1.0 4 -1.0 dtype: float64
To include NA values in the operation, use
skipna=False>>> s.cummin(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the minimum in each column. This is equivalent to
axis=Noneoraxis='index'.>>> df.cummin() A B 0 2.0 1.0 1 2.0 NaN 2 1.0 0.0
To iterate over columns and find the minimum in each row, use
axis=1>>> df.cummin(axis=1) A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
- cumprod(axis: Axis | None = None, skipna: bool = True, *args, **kwargs)
Return cumulative product over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative product.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args – Additional keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
Return cumulative product of scalar or Series.
- Return type:
scalar or Series
See also
core.window.expanding.Expanding.prodSimilar functionality but ignores
NaNvalues.Series.prodReturn the product over Series axis.
Series.cummaxReturn cumulative maximum over Series axis.
Series.cumminReturn cumulative minimum over Series axis.
Series.cumsumReturn cumulative sum over Series axis.
Series.cumprodReturn cumulative product over Series axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cumprod() 0 2.0 1 NaN 2 10.0 3 -10.0 4 -0.0 dtype: float64
To include NA values in the operation, use
skipna=False>>> s.cumprod(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the product in each column. This is equivalent to
axis=Noneoraxis='index'.>>> df.cumprod() A B 0 2.0 1.0 1 6.0 NaN 2 6.0 0.0
To iterate over columns and find the product in each row, use
axis=1>>> df.cumprod(axis=1) A B 0 2.0 2.0 1 3.0 NaN 2 1.0 0.0
- cumsum(axis: Axis | None = None, skipna: bool = True, *args, **kwargs)
Return cumulative sum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative sum.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args – Additional keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
Return cumulative sum of scalar or Series.
- Return type:
scalar or Series
See also
core.window.expanding.Expanding.sumSimilar functionality but ignores
NaNvalues.Series.sumReturn the sum over Series axis.
Series.cummaxReturn cumulative maximum over Series axis.
Series.cumminReturn cumulative minimum over Series axis.
Series.cumsumReturn cumulative sum over Series axis.
Series.cumprodReturn cumulative product over Series axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cumsum() 0 2.0 1 NaN 2 7.0 3 6.0 4 6.0 dtype: float64
To include NA values in the operation, use
skipna=False>>> s.cumsum(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the sum in each column. This is equivalent to
axis=Noneoraxis='index'.>>> df.cumsum() A B 0 2.0 1.0 1 5.0 NaN 2 6.0 1.0
To iterate over columns and find the sum in each row, use
axis=1>>> df.cumsum(axis=1) A B 0 2.0 3.0 1 3.0 NaN 2 1.0 1.0
- diff(periods: int = 1) Series
First discrete difference of element.
Calculates the difference of a Series element compared with another element in the Series (default is element in previous row).
- Parameters:
periods (int, default 1) – Periods to shift for calculating difference, accepts negative values.
- Returns:
First differences of the Series.
- Return type:
See also
Series.pct_changePercent change over given number of periods.
Series.shiftShift index by desired number of periods with an optional time freq.
DataFrame.diffFirst discrete difference of object.
Notes
For boolean dtypes, this uses
operator.xor()rather thanoperator.sub(). The result is calculated according to current dtype in Series, however dtype of the result is always float64.Examples
Difference with previous row
>>> s = pd.Series([1, 1, 2, 3, 5, 8]) >>> s.diff() 0 NaN 1 0.0 2 1.0 3 1.0 4 2.0 5 3.0 dtype: float64
Difference with 3rd previous row
>>> s.diff(periods=3) 0 NaN 1 NaN 2 NaN 3 2.0 4 4.0 5 6.0 dtype: float64
Difference with following row
>>> s.diff(periods=-1) 0 0.0 1 -1.0 2 -1.0 3 -2.0 4 -3.0 5 NaN dtype: float64
Overflow in input dtype
>>> s = pd.Series([1, 0], dtype=np.uint8) >>> s.diff() 0 NaN 1 255.0 dtype: float64
- div(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Floating division of series and other, element-wise (binary operator truediv).
Equivalent to
series / other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.rtruedivReverse of the Floating division operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.divide(b, fill_value=0) a 1.0 b inf c inf d 0.0 e NaN dtype: float64
- divide(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Floating division of series and other, element-wise (binary operator truediv).
Equivalent to
series / other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.rtruedivReverse of the Floating division operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.divide(b, fill_value=0) a 1.0 b inf c inf d 0.0 e NaN dtype: float64
- divmod(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Integer division and modulo of series and other, element-wise (binary operator divmod).
Equivalent to
divmod(series, other), but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
2-Tuple of Series
See also
Series.rdivmodReverse of the Integer division and modulo operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.divmod(b, fill_value=0) (a 1.0 b inf c inf d 0.0 e NaN dtype: float64, a 0.0 b NaN c NaN d 0.0 e NaN dtype: float64)
- dot(other: AnyArrayLike) Series | np.ndarray
Compute the dot product between the Series and the columns of other.
This method computes the dot product between the Series and another one, or the Series and each columns of a DataFrame, or the Series and each columns of an array.
It can also be called using self @ other.
- Parameters:
other (Series, DataFrame or array-like) – The other object to compute the dot product with its columns.
- Returns:
Return the dot product of the Series and other if other is a Series, the Series of the dot product of Series and each rows of other if other is a DataFrame or a numpy.ndarray between the Series and each columns of the numpy array.
- Return type:
scalar, Series or numpy.ndarray
See also
DataFrame.dotCompute the matrix product with the DataFrame.
Series.mulMultiplication of series and other, element-wise.
Notes
The Series and other has to share the same index if other is a Series or a DataFrame.
Examples
>>> s = pd.Series([0, 1, 2, 3]) >>> other = pd.Series([-1, 2, -3, 4]) >>> s.dot(other) 8 >>> s @ other 8 >>> df = pd.DataFrame([[0, 1], [-2, 3], [4, -5], [6, 7]]) >>> s.dot(df) 0 24 1 14 dtype: int64 >>> arr = np.array([[0, 1], [-2, 3], [4, -5], [6, 7]]) >>> s.dot(arr) array([24, 14])
- drop(labels: IndexLabel = None, *, axis: Axis = 0, index: IndexLabel = None, columns: IndexLabel = None, level: Level | None = None, inplace: Literal[True], errors: IgnoreRaise = 'raise') None
- drop(labels: IndexLabel = None, *, axis: Axis = 0, index: IndexLabel = None, columns: IndexLabel = None, level: Level | None = None, inplace: Literal[False] = False, errors: IgnoreRaise = 'raise') Series
- drop(labels: IndexLabel = None, *, axis: Axis = 0, index: IndexLabel = None, columns: IndexLabel = None, level: Level | None = None, inplace: bool = False, errors: IgnoreRaise = 'raise') Series | None
Return Series with specified index labels removed.
Remove elements of a Series based on specifying the index labels. When using a multi-index, labels on different levels can be removed by specifying the level.
- Parameters:
labels (single label or list-like) – Index labels to drop.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
index (single label or list-like) – Redundant for application on Series, but ‘index’ can be used instead of ‘labels’.
columns (single label or list-like) – No change is made to the Series; use ‘index’ or ‘labels’ instead.
level (int or level name, optional) – For MultiIndex, level for which the labels will be removed.
inplace (bool, default False) – If True, do operation inplace and return None.
errors ({'ignore', 'raise'}, default 'raise') – If ‘ignore’, suppress error and only existing labels are dropped.
- Returns:
Series with specified index labels removed or None if
inplace=True.- Return type:
Series or None
- Raises:
KeyError – If none of the labels are found in the index.
See also
Series.reindexReturn only specified index labels of Series.
Series.dropnaReturn series without null values.
Series.drop_duplicatesReturn Series with duplicate values removed.
DataFrame.dropDrop specified labels from rows or columns.
Examples
>>> s = pd.Series(data=np.arange(3), index=['A', 'B', 'C']) >>> s A 0 B 1 C 2 dtype: int64
Drop labels B en C
>>> s.drop(labels=['B', 'C']) A 0 dtype: int64
Drop 2nd level label in MultiIndex Series
>>> midx = pd.MultiIndex(levels=[['llama', 'cow', 'falcon'], ... ['speed', 'weight', 'length']], ... codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2], ... [0, 1, 2, 0, 1, 2, 0, 1, 2]]) >>> s = pd.Series([45, 200, 1.2, 30, 250, 1.5, 320, 1, 0.3], ... index=midx) >>> s llama speed 45.0 weight 200.0 length 1.2 cow speed 30.0 weight 250.0 length 1.5 falcon speed 320.0 weight 1.0 length 0.3 dtype: float64
>>> s.drop(labels='weight', level=1) llama speed 45.0 length 1.2 cow speed 30.0 length 1.5 falcon speed 320.0 length 0.3 dtype: float64
- drop_duplicates(*, keep: DropKeep = 'first', inplace: Literal[False] = False, ignore_index: bool = False) Series
- drop_duplicates(*, keep: DropKeep = 'first', inplace: Literal[True], ignore_index: bool = False) None
- drop_duplicates(*, keep: DropKeep = 'first', inplace: bool = False, ignore_index: bool = False) Series | None
Return Series with duplicate values removed.
- Parameters:
keep ({‘first’, ‘last’,
False}, default ‘first’) –Method to handle dropping duplicates:
’first’ : Drop duplicates except for the first occurrence.
’last’ : Drop duplicates except for the last occurrence.
False: Drop all duplicates.
inplace (bool, default
False) – IfTrue, performs operation inplace and returns None.ignore_index (bool, default
False) –If
True, the resulting axis will be labeled 0, 1, …, n - 1.Added in version 2.0.0.
- Returns:
Series with duplicates dropped or None if
inplace=True.- Return type:
Series or None
See also
Index.drop_duplicatesEquivalent method on Index.
DataFrame.drop_duplicatesEquivalent method on DataFrame.
Series.duplicatedRelated method on Series, indicating duplicate Series values.
Series.uniqueReturn unique values as an array.
Examples
Generate a Series with duplicated entries.
>>> s = pd.Series(['llama', 'cow', 'llama', 'beetle', 'llama', 'hippo'], ... name='animal') >>> s 0 llama 1 cow 2 llama 3 beetle 4 llama 5 hippo Name: animal, dtype: object
With the ‘keep’ parameter, the selection behaviour of duplicated values can be changed. The value ‘first’ keeps the first occurrence for each set of duplicated entries. The default value of keep is ‘first’.
>>> s.drop_duplicates() 0 llama 1 cow 3 beetle 5 hippo Name: animal, dtype: object
The value ‘last’ for parameter ‘keep’ keeps the last occurrence for each set of duplicated entries.
>>> s.drop_duplicates(keep='last') 1 cow 3 beetle 4 llama 5 hippo Name: animal, dtype: object
The value
Falsefor parameter ‘keep’ discards all sets of duplicated entries.>>> s.drop_duplicates(keep=False) 1 cow 3 beetle 5 hippo Name: animal, dtype: object
- dropna(*, axis: Axis = 0, inplace: Literal[False] = False, how: AnyAll | None = None, ignore_index: bool = False) Series
- dropna(*, axis: Axis = 0, inplace: Literal[True], how: AnyAll | None = None, ignore_index: bool = False) None
Return a new Series with missing values removed.
See the User Guide for more on which values are considered missing, and how to work with missing data.
- Parameters:
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
inplace (bool, default False) – If True, do operation inplace and return None.
how (str, optional) – Not in use. Kept for compatibility.
ignore_index (bool, default
False) –If
True, the resulting axis will be labeled 0, 1, …, n - 1.Added in version 2.0.0.
- Returns:
Series with NA entries dropped from it or None if
inplace=True.- Return type:
Series or None
See also
Series.isnaIndicate missing values.
Series.notnaIndicate existing (non-missing) values.
Series.fillnaReplace missing values.
DataFrame.dropnaDrop rows or columns which contain NA values.
Index.dropnaDrop missing indices.
Examples
>>> ser = pd.Series([1., 2., np.nan]) >>> ser 0 1.0 1 2.0 2 NaN dtype: float64
Drop NA values from a Series.
>>> ser.dropna() 0 1.0 1 2.0 dtype: float64
Empty strings are not considered NA values.
Noneis considered an NA value.>>> ser = pd.Series([np.nan, 2, pd.NaT, '', None, 'I stay']) >>> ser 0 NaN 1 2 2 NaT 3 4 None 5 I stay dtype: object >>> ser.dropna() 1 2 3 5 I stay dtype: object
- dt
alias of
CombinedDatetimelikeProperties
- property dtype: DtypeObj
Return the dtype object of the underlying data.
Examples
>>> s = pd.Series([1, 2, 3]) >>> s.dtype dtype('int64')
- property dtypes: DtypeObj
Return the dtype object of the underlying data.
Examples
>>> s = pd.Series([1, 2, 3]) >>> s.dtypes dtype('int64')
- duplicated(keep: DropKeep = 'first') Series
Indicate duplicate Series values.
Duplicated values are indicated as
Truevalues in the resulting Series. Either all duplicates, all except the first or all except the last occurrence of duplicates can be indicated.- Parameters:
keep ({'first', 'last', False}, default 'first') –
Method to handle dropping duplicates:
’first’ : Mark duplicates as
Trueexcept for the first occurrence.’last’ : Mark duplicates as
Trueexcept for the last occurrence.False: Mark all duplicates asTrue.
- Returns:
Series indicating whether each value has occurred in the preceding values.
- Return type:
Series[bool]
See also
Index.duplicatedEquivalent method on pandas.Index.
DataFrame.duplicatedEquivalent method on pandas.DataFrame.
Series.drop_duplicatesRemove duplicate values from Series.
Examples
By default, for each set of duplicated values, the first occurrence is set on False and all others on True:
>>> animals = pd.Series(['llama', 'cow', 'llama', 'beetle', 'llama']) >>> animals.duplicated() 0 False 1 False 2 True 3 False 4 True dtype: bool
which is equivalent to
>>> animals.duplicated(keep='first') 0 False 1 False 2 True 3 False 4 True dtype: bool
By using ‘last’, the last occurrence of each set of duplicated values is set on False and all others on True:
>>> animals.duplicated(keep='last') 0 True 1 False 2 True 3 False 4 False dtype: bool
By setting keep on
False, all duplicates are True:>>> animals.duplicated(keep=False) 0 True 1 False 2 True 3 False 4 True dtype: bool
- eq(other, level: Level | None = None, fill_value: float | None = None, axis: Axis = 0) Series
Return Equal to of series and other, element-wise (binary operator eq).
Equivalent to
series == other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.eq(b, fill_value=0) a True b False c False d False e False dtype: bool
- explode(ignore_index: bool = False) Series
Transform each element of a list-like to a row.
- Parameters:
ignore_index (bool, default False) – If True, the resulting index will be labeled 0, 1, …, n - 1.
- Returns:
Exploded lists to rows; index will be duplicated for these rows.
- Return type:
See also
Series.str.splitSplit string values on specified separator.
Series.unstackUnstack, a.k.a. pivot, Series with MultiIndex to produce DataFrame.
DataFrame.meltUnpivot a DataFrame from wide format to long format.
DataFrame.explodeExplode a DataFrame from list-like columns to long format.
Notes
This routine will explode list-likes including lists, tuples, sets, Series, and np.ndarray. The result dtype of the subset rows will be object. Scalars will be returned unchanged, and empty list-likes will result in a np.nan for that row. In addition, the ordering of elements in the output will be non-deterministic when exploding sets.
Reference the user guide for more examples.
Examples
>>> s = pd.Series([[1, 2, 3], 'foo', [], [3, 4]]) >>> s 0 [1, 2, 3] 1 foo 2 [] 3 [3, 4] dtype: object
>>> s.explode() 0 1 0 2 0 3 1 foo 2 NaN 3 3 3 4 dtype: object
- floordiv(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Integer division of series and other, element-wise (binary operator floordiv).
Equivalent to
series // other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.rfloordivReverse of the Integer division operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.floordiv(b, fill_value=0) a 1.0 b inf c inf d 0.0 e NaN dtype: float64
- ge(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Greater than or equal to of series and other, element-wise (binary operator ge).
Equivalent to
series >= other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
Examples
>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e']) >>> a a 1.0 b 1.0 c 1.0 d NaN e 1.0 dtype: float64 >>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f']) >>> b a 0.0 b 1.0 c 2.0 d NaN f 1.0 dtype: float64 >>> a.ge(b, fill_value=0) a True b True c False d False e True f False dtype: bool
- groupby(by=None, axis: Axis = 0, level: IndexLabel | None = None, as_index: bool = True, sort: bool = True, group_keys: bool = True, observed: bool | lib.NoDefault = <no_default>, dropna: bool = True) SeriesGroupBy
Group Series using a mapper or by a Series of columns.
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
- Parameters:
by (mapping, function, label, pd.Grouper or list of such) –
Used to determine the groups for the groupby. If
byis a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see.align()method). If a list or ndarray of length equal to the selected axis is passed (see the groupby user guide), the values are used as-is to determine the groups. A label or list of labels may be passed to group by the columns inself. Notice that a tuple is interpreted as a (single) key.axis ({0 or 'index', 1 or 'columns'}, default 0) –
Split along rows (0) or columns (1). For Series this parameter is unused and defaults to 0.
Deprecated since version 2.1.0: Will be removed and behave like axis=0 in a future version. For
axis=1, doframe.T.groupby(...)instead.level (int, level name, or sequence of such, default None) – If the axis is a MultiIndex (hierarchical), group by a particular level or levels. Do not specify both
byandlevel.as_index (bool, default True) –
Return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output. This argument has no effect on filtrations (see the filtrations in the user guide), such as
head(),tail(),nth()and in transformations (see the transformations in the user guide).sort (bool, default True) –
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group. If False, the groups will appear in the same order as they did in the original DataFrame. This argument has no effect on filtrations (see the filtrations in the user guide), such as
head(),tail(),nth()and in transformations (see the transformations in the user guide).Changed in version 2.0.0: Specifying
sort=Falsewith an ordered categorical grouper will no longer sort the values.group_keys (bool, default True) –
When calling apply and the
byargument produces a like-indexed (i.e. a transform) result, add group keys to index to identify pieces. By default group keys are not included when the result’s index (and column) labels match the inputs, and are included otherwise.Changed in version 1.5.0: Warns that
group_keyswill no longer be ignored when the result fromapplyis a like-indexed Series or DataFrame. Specifygroup_keysexplicitly to include the group keys or not.Changed in version 2.0.0:
group_keysnow defaults toTrue.observed (bool, default False) –
This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
Deprecated since version 2.1.0: The default value will change to True in a future version of pandas.
dropna (bool, default True) – If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.
- Returns:
Returns a groupby object that contains information about the groups.
- Return type:
pandas.api.typing.SeriesGroupBy
See also
resampleConvenience method for frequency conversion and resampling of time series.
Notes
See the user guide for more detailed usage and examples, including splitting an object into groups, iterating through groups, selecting a group, aggregation, and more.
Examples
>>> ser = pd.Series([390., 350., 30., 20.], ... index=['Falcon', 'Falcon', 'Parrot', 'Parrot'], ... name="Max Speed") >>> ser Falcon 390.0 Falcon 350.0 Parrot 30.0 Parrot 20.0 Name: Max Speed, dtype: float64 >>> ser.groupby(["a", "b", "a", "b"]).mean() a 210.0 b 185.0 Name: Max Speed, dtype: float64 >>> ser.groupby(level=0).mean() Falcon 370.0 Parrot 25.0 Name: Max Speed, dtype: float64 >>> ser.groupby(ser > 100).mean() Max Speed False 25.0 True 370.0 Name: Max Speed, dtype: float64
Grouping by Indexes
We can groupby different levels of a hierarchical index using the level parameter:
>>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'], ... ['Captive', 'Wild', 'Captive', 'Wild']] >>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type')) >>> ser = pd.Series([390., 350., 30., 20.], index=index, name="Max Speed") >>> ser Animal Type Falcon Captive 390.0 Wild 350.0 Parrot Captive 30.0 Wild 20.0 Name: Max Speed, dtype: float64 >>> ser.groupby(level=0).mean() Animal Falcon 370.0 Parrot 25.0 Name: Max Speed, dtype: float64 >>> ser.groupby(level="Type").mean() Type Captive 210.0 Wild 185.0 Name: Max Speed, dtype: float64
We can also choose to include NA in group keys or not by defining dropna parameter, the default setting is True.
>>> ser = pd.Series([1, 2, 3, 3], index=["a", 'a', 'b', np.nan]) >>> ser.groupby(level=0).sum() a 3 b 3 dtype: int64
>>> ser.groupby(level=0, dropna=False).sum() a 3 b 3 NaN 3 dtype: int64
>>> arrays = ['Falcon', 'Falcon', 'Parrot', 'Parrot'] >>> ser = pd.Series([390., 350., 30., 20.], index=arrays, name="Max Speed") >>> ser.groupby(["a", "b", "a", np.nan]).mean() a 210.0 b 350.0 Name: Max Speed, dtype: float64
>>> ser.groupby(["a", "b", "a", np.nan], dropna=False).mean() a 210.0 b 350.0 NaN 20.0 Name: Max Speed, dtype: float64
- gt(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Greater than of series and other, element-wise (binary operator gt).
Equivalent to
series > other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
Examples
>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e']) >>> a a 1.0 b 1.0 c 1.0 d NaN e 1.0 dtype: float64 >>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f']) >>> b a 0.0 b 1.0 c 2.0 d NaN f 1.0 dtype: float64 >>> a.gt(b, fill_value=0) a True b False c False d False e True f False dtype: bool
- property hasnans: bool
Return True if there are any NaNs.
Enables various performance speedups.
- Return type:
bool
Examples
>>> s = pd.Series([1, 2, 3, None]) >>> s 0 1.0 1 2.0 2 3.0 3 NaN dtype: float64 >>> s.hasnans True
- hist(by=None, ax=None, grid: bool = True, xlabelsize: int | None = None, xrot: float | None = None, ylabelsize: int | None = None, yrot: float | None = None, figsize: tuple[int, int] | None = None, bins: int | Sequence[int] = 10, backend: str | None = None, legend: bool = False, **kwargs)
Draw histogram of the input series using matplotlib.
- Parameters:
by (object, optional) – If passed, then used to form histograms for separate groups.
ax (matplotlib axis object) – If not passed, uses gca().
grid (bool, default True) – Whether to show axis grid lines.
xlabelsize (int, default None) – If specified changes the x-axis label size.
xrot (float, default None) – Rotation of x axis labels.
ylabelsize (int, default None) – If specified changes the y-axis label size.
yrot (float, default None) – Rotation of y axis labels.
figsize (tuple, default None) – Figure size in inches by default.
bins (int or sequence, default 10) – Number of histogram bins to be used. If an integer is given, bins + 1 bin edges are calculated and returned. If bins is a sequence, gives bin edges, including left edge of first bin and right edge of last bin. In this case, bins is returned unmodified.
backend (str, default None) – Backend to use instead of the backend specified in the option
plotting.backend. For instance, ‘matplotlib’. Alternatively, to specify theplotting.backendfor the whole session, setpd.options.plotting.backend.legend (bool, default False) – Whether to show the legend.
**kwargs – To be passed to the actual plotting function.
- Returns:
A histogram plot.
- Return type:
matplotlib.AxesSubplot
See also
matplotlib.axes.Axes.histPlot a histogram using matplotlib.
Examples
For Series:
For Groupby:
- idxmax(axis: Axis = 0, skipna: bool = True, *args, **kwargs) Hashable
Return the row label of the maximum value.
If multiple values equal the maximum, the first row label with that value is returned.
- Parameters:
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
skipna (bool, default True) – Exclude NA/null values. If the entire Series is NA, the result will be NA.
*args – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
Label of the maximum value.
- Return type:
Index
- Raises:
ValueError – If the Series is empty.
See also
numpy.argmaxReturn indices of the maximum values along the given axis.
DataFrame.idxmaxReturn index of first occurrence of maximum over requested axis.
Series.idxminReturn index label of the first occurrence of minimum of values.
Notes
This method is the Series version of
ndarray.argmax. This method returns the label of the maximum, whilendarray.argmaxreturns the position. To get the position, useseries.values.argmax().Examples
>>> s = pd.Series(data=[1, None, 4, 3, 4], ... index=['A', 'B', 'C', 'D', 'E']) >>> s A 1.0 B NaN C 4.0 D 3.0 E 4.0 dtype: float64
>>> s.idxmax() 'C'
If skipna is False and there is an NA value in the data, the function returns
nan.>>> s.idxmax(skipna=False) nan
- idxmin(axis: Axis = 0, skipna: bool = True, *args, **kwargs) Hashable
Return the row label of the minimum value.
If multiple values equal the minimum, the first row label with that value is returned.
- Parameters:
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
skipna (bool, default True) – Exclude NA/null values. If the entire Series is NA, the result will be NA.
*args – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
Label of the minimum value.
- Return type:
Index
- Raises:
ValueError – If the Series is empty.
See also
numpy.argminReturn indices of the minimum values along the given axis.
DataFrame.idxminReturn index of first occurrence of minimum over requested axis.
Series.idxmaxReturn index label of the first occurrence of maximum of values.
Notes
This method is the Series version of
ndarray.argmin. This method returns the label of the minimum, whilendarray.argminreturns the position. To get the position, useseries.values.argmin().Examples
>>> s = pd.Series(data=[1, None, 4, 1], ... index=['A', 'B', 'C', 'D']) >>> s A 1.0 B NaN C 4.0 D 1.0 dtype: float64
>>> s.idxmin() 'A'
If skipna is False and there is an NA value in the data, the function returns
nan.>>> s.idxmin(skipna=False) nan
- index
The index (axis labels) of the Series.
The index of a Series is used to label and identify each element of the underlying data. The index can be thought of as an immutable ordered set (technically a multi-set, as it may contain duplicate labels), and is used to index and align data in pandas.
- Returns:
The index labels of the Series.
- Return type:
Index
See also
Series.reindexConform Series to new index.
IndexThe base pandas index type.
Notes
For more information on pandas indexing, see the indexing user guide.
Examples
To create a Series with a custom index and view the index labels:
>>> cities = ['Kolkata', 'Chicago', 'Toronto', 'Lisbon'] >>> populations = [14.85, 2.71, 2.93, 0.51] >>> city_series = pd.Series(populations, index=cities) >>> city_series.index Index(['Kolkata', 'Chicago', 'Toronto', 'Lisbon'], dtype='object')
To change the index labels of an existing Series:
>>> city_series.index = ['KOL', 'CHI', 'TOR', 'LIS'] >>> city_series.index Index(['KOL', 'CHI', 'TOR', 'LIS'], dtype='object')
- info(verbose: bool | None = None, buf: IO[str] | None = None, max_cols: int | None = None, memory_usage: bool | str | None = None, show_counts: bool = True) None
Print a concise summary of a Series.
This method prints information about a Series including the index dtype, non-null values and memory usage.
Added in version 1.4.0.
- Parameters:
verbose (bool, optional) – Whether to print the full summary. By default, the setting in
pandas.options.display.max_info_columnsis followed.buf (writable buffer, defaults to sys.stdout) – Where to send the output. By default, the output is printed to sys.stdout. Pass a writable buffer if you need to further process the output.
memory_usage (bool, str, optional) –
Specifies whether total memory usage of the Series elements (including the index) should be displayed. By default, this follows the
pandas.options.display.memory_usagesetting.True always show memory usage. False never shows memory usage. A value of ‘deep’ is equivalent to “True with deep introspection”. Memory usage is shown in human-readable units (base-2 representation). Without deep introspection a memory estimation is made based in column dtype and number of rows assuming values consume the same memory amount for corresponding dtypes. With deep memory introspection, a real memory usage calculation is performed at the cost of computational resources. See the Frequently Asked Questions for more details.
show_counts (bool, optional) – Whether to show the non-null counts. By default, this is shown only if the DataFrame is smaller than
pandas.options.display.max_info_rowsandpandas.options.display.max_info_columns. A value of True always shows the counts, and False never shows the counts.
- Returns:
This method prints a summary of a Series and returns None.
- Return type:
None
See also
Series.describeGenerate descriptive statistics of Series.
Series.memory_usageMemory usage of Series.
Examples
>>> int_values = [1, 2, 3, 4, 5] >>> text_values = ['alpha', 'beta', 'gamma', 'delta', 'epsilon'] >>> s = pd.Series(text_values, index=int_values) >>> s.info() <class 'pandas.core.series.Series'> Index: 5 entries, 1 to 5 Series name: None Non-Null Count Dtype -------------- ----- 5 non-null object dtypes: object(1) memory usage: 80.0+ bytes
Prints a summary excluding information about its values:
>>> s.info(verbose=False) <class 'pandas.core.series.Series'> Index: 5 entries, 1 to 5 dtypes: object(1) memory usage: 80.0+ bytes
Pipe output of Series.info to buffer instead of sys.stdout, get buffer content and writes to a text file:
>>> import io >>> buffer = io.StringIO() >>> s.info(buf=buffer) >>> s = buffer.getvalue() >>> with open("df_info.txt", "w", ... encoding="utf-8") as f: ... f.write(s) 260
The memory_usage parameter allows deep introspection mode, specially useful for big Series and fine-tune memory optimization:
>>> random_strings_array = np.random.choice(['a', 'b', 'c'], 10 ** 6) >>> s = pd.Series(np.random.choice(['a', 'b', 'c'], 10 ** 6)) >>> s.info() <class 'pandas.core.series.Series'> RangeIndex: 1000000 entries, 0 to 999999 Series name: None Non-Null Count Dtype -------------- ----- 1000000 non-null object dtypes: object(1) memory usage: 7.6+ MB
>>> s.info(memory_usage='deep') <class 'pandas.core.series.Series'> RangeIndex: 1000000 entries, 0 to 999999 Series name: None Non-Null Count Dtype -------------- ----- 1000000 non-null object dtypes: object(1) memory usage: 55.3 MB
- isin(values) Series
Whether elements in Series are contained in values.
Return a boolean Series showing whether each element in the Series matches an element in the passed sequence of values exactly.
- Parameters:
values (set or list-like) – The sequence of values to test. Passing in a single string will raise a
TypeError. Instead, turn a single string into a list of one element.- Returns:
Series of booleans indicating if each element is in values.
- Return type:
- Raises:
TypeError –
If values is a string
See also
DataFrame.isinEquivalent method on DataFrame.
Examples
>>> s = pd.Series(['llama', 'cow', 'llama', 'beetle', 'llama', ... 'hippo'], name='animal') >>> s.isin(['cow', 'llama']) 0 True 1 True 2 True 3 False 4 True 5 False Name: animal, dtype: bool
To invert the boolean values, use the
~operator:>>> ~s.isin(['cow', 'llama']) 0 False 1 False 2 False 3 True 4 False 5 True Name: animal, dtype: bool
Passing a single string as
s.isin('llama')will raise an error. Use a list of one element instead:>>> s.isin(['llama']) 0 True 1 False 2 True 3 False 4 True 5 False Name: animal, dtype: bool
Strings and integers are distinct and are therefore not comparable:
>>> pd.Series([1]).isin(['1']) 0 False dtype: bool >>> pd.Series([1.1]).isin(['1.1']) 0 False dtype: bool
- isna() Series
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True).- Returns:
Mask of bool values for each element in Series that indicates whether an element is an NA value.
- Return type:
See also
Series.isnullAlias of isna.
Series.notnaBoolean inverse of isna.
Series.dropnaOmit axes labels with missing values.
isnaTop-level isna.
Examples
Show which entries in a DataFrame are NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.nan], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.isna() age born name toy 0 False True False True 1 False False False False 2 True False False False
Show which entries in a Series are NA.
>>> ser = pd.Series([5, 6, np.nan]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.isna() 0 False 1 False 2 True dtype: bool
- isnull() Series
Series.isnull is an alias for Series.isna.
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True).- Returns:
Mask of bool values for each element in Series that indicates whether an element is an NA value.
- Return type:
See also
Series.isnullAlias of isna.
Series.notnaBoolean inverse of isna.
Series.dropnaOmit axes labels with missing values.
isnaTop-level isna.
Examples
Show which entries in a DataFrame are NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.nan], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.isna() age born name toy 0 False True False True 1 False False False False 2 True False False False
Show which entries in a Series are NA.
>>> ser = pd.Series([5, 6, np.nan]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.isna() 0 False 1 False 2 True dtype: bool
- items() Iterable[tuple[Hashable, Any]]
Lazily iterate over (index, value) tuples.
This method returns an iterable tuple (index, value). This is convenient if you want to create a lazy iterator.
- Returns:
Iterable of tuples containing the (index, value) pairs from a Series.
- Return type:
iterable
See also
DataFrame.itemsIterate over (column name, Series) pairs.
DataFrame.iterrowsIterate over DataFrame rows as (index, Series) pairs.
Examples
>>> s = pd.Series(['A', 'B', 'C']) >>> for index, value in s.items(): ... print(f"Index : {index}, Value : {value}") Index : 0, Value : A Index : 1, Value : B Index : 2, Value : C
- keys() Index
Return alias for index.
- Returns:
Index of the Series.
- Return type:
Index
Examples
>>> s = pd.Series([1, 2, 3], index=[0, 1, 2]) >>> s.keys() Index([0, 1, 2], dtype='int64')
- kurt(axis: Axis | None = 0, skipna: bool = True, numeric_only: bool = False, **kwargs)
Return unbiased kurtosis over requested axis.
Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.
- Parameters:
axis ({index (0)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.Added in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Returns:
Examples
>>> s = pd.Series([1, 2, 2, 3], index=['cat', 'dog', 'dog', 'mouse']) >>> s cat 1 dog 2 dog 2 mouse 3 dtype: int64 >>> s.kurt() 1.5
With a DataFrame
>>> df = pd.DataFrame({'a': [1, 2, 2, 3], 'b': [3, 4, 4, 4]}, ... index=['cat', 'dog', 'dog', 'mouse']) >>> df a b cat 1 3 dog 2 4 dog 2 4 mouse 3 4 >>> df.kurt() a 1.5 b 4.0 dtype: float64
With axis=None
>>> df.kurt(axis=None).round(6) -0.988693
Using axis=1
>>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [3, 4], 'd': [1, 2]}, ... index=['cat', 'dog']) >>> df.kurt(axis=1) cat -6.0 dog -6.0 dtype: float64
- Return type:
scalar or scalar
- kurtosis(axis: Axis | None = 0, skipna: bool = True, numeric_only: bool = False, **kwargs)
Return unbiased kurtosis over requested axis.
Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.
- Parameters:
axis ({index (0)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.Added in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Returns:
Examples
>>> s = pd.Series([1, 2, 2, 3], index=['cat', 'dog', 'dog', 'mouse']) >>> s cat 1 dog 2 dog 2 mouse 3 dtype: int64 >>> s.kurt() 1.5
With a DataFrame
>>> df = pd.DataFrame({'a': [1, 2, 2, 3], 'b': [3, 4, 4, 4]}, ... index=['cat', 'dog', 'dog', 'mouse']) >>> df a b cat 1 3 dog 2 4 dog 2 4 mouse 3 4 >>> df.kurt() a 1.5 b 4.0 dtype: float64
With axis=None
>>> df.kurt(axis=None).round(6) -0.988693
Using axis=1
>>> df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [3, 4], 'd': [1, 2]}, ... index=['cat', 'dog']) >>> df.kurt(axis=1) cat -6.0 dog -6.0 dtype: float64
- Return type:
scalar or scalar
- le(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Less than or equal to of series and other, element-wise (binary operator le).
Equivalent to
series <= other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
Examples
>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e']) >>> a a 1.0 b 1.0 c 1.0 d NaN e 1.0 dtype: float64 >>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f']) >>> b a 0.0 b 1.0 c 2.0 d NaN f 1.0 dtype: float64 >>> a.le(b, fill_value=0) a False b True c True d False e False f True dtype: bool
- list
alias of
ListAccessor
- lt(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Less than of series and other, element-wise (binary operator lt).
Equivalent to
series < other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
Examples
>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e']) >>> a a 1.0 b 1.0 c 1.0 d NaN e 1.0 dtype: float64 >>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f']) >>> b a 0.0 b 1.0 c 2.0 d NaN f 1.0 dtype: float64 >>> a.lt(b, fill_value=0) a False b False c True d False e False f True dtype: bool
- map(arg: Callable | Mapping | Series, na_action: Literal['ignore'] | None = None) Series
Map values of Series according to an input mapping or function.
Used for substituting each value in a Series with another value, that may be derived from a function, a
dictor aSeries.- Parameters:
arg (function, collections.abc.Mapping subclass or Series) – Mapping correspondence.
na_action ({None, 'ignore'}, default None) – If ‘ignore’, propagate NaN values, without passing them to the mapping correspondence.
- Returns:
Same index as caller.
- Return type:
See also
Series.applyFor applying more complex functions on a Series.
Series.replaceReplace values given in to_replace with value.
DataFrame.applyApply a function row-/column-wise.
DataFrame.mapApply a function elementwise on a whole DataFrame.
Notes
When
argis a dictionary, values in Series that are not in the dictionary (as keys) are converted toNaN. However, if the dictionary is adictsubclass that defines__missing__(i.e. provides a method for default values), then this default is used rather thanNaN.Examples
>>> s = pd.Series(['cat', 'dog', np.nan, 'rabbit']) >>> s 0 cat 1 dog 2 NaN 3 rabbit dtype: object
mapaccepts adictor aSeries. Values that are not found in thedictare converted toNaN, unless the dict has a default value (e.g.defaultdict):>>> s.map({'cat': 'kitten', 'dog': 'puppy'}) 0 kitten 1 puppy 2 NaN 3 NaN dtype: object
It also accepts a function:
>>> s.map('I am a {}'.format) 0 I am a cat 1 I am a dog 2 I am a nan 3 I am a rabbit dtype: object
To avoid applying the function to missing values (and keep them as
NaN)na_action='ignore'can be used:>>> s.map('I am a {}'.format, na_action='ignore') 0 I am a cat 1 I am a dog 2 NaN 3 I am a rabbit dtype: object
- max(axis: Axis | None = 0, skipna: bool = True, numeric_only: bool = False, **kwargs)
Return the maximum of the values over the requested axis.
If you want the index of the maximum, use
idxmax. This is the equivalent of thenumpy.ndarraymethodargmax.- Parameters:
axis ({index (0)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.Added in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
scalar or scalar
See also
Series.sumReturn the sum.
Series.minReturn the minimum.
Series.maxReturn the maximum.
Series.idxminReturn the index of the minimum.
Series.idxmaxReturn the index of the maximum.
DataFrame.sumReturn the sum over the requested axis.
DataFrame.minReturn the minimum over the requested axis.
DataFrame.maxReturn the maximum over the requested axis.
DataFrame.idxminReturn the index of the minimum over the requested axis.
DataFrame.idxmaxReturn the index of the maximum over the requested axis.
Examples
>>> idx = pd.MultiIndex.from_arrays([ ... ['warm', 'warm', 'cold', 'cold'], ... ['dog', 'falcon', 'fish', 'spider']], ... names=['blooded', 'animal']) >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> s blooded animal warm dog 4 falcon 2 cold fish 0 spider 8 Name: legs, dtype: int64
>>> s.max() 8
- mean(axis: Axis | None = 0, skipna: bool = True, numeric_only: bool = False, **kwargs)
Return the mean of the values over the requested axis.
- Parameters:
axis ({index (0)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.Added in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Returns:
Examples
>>> s = pd.Series([1, 2, 3]) >>> s.mean() 2.0
With a DataFrame
>>> df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['tiger', 'zebra']) >>> df a b tiger 1 2 zebra 2 3 >>> df.mean() a 1.5 b 2.5 dtype: float64
Using axis=1
>>> df.mean(axis=1) tiger 1.5 zebra 2.5 dtype: float64
In this case, numeric_only should be set to True to avoid getting an error.
>>> df = pd.DataFrame({'a': [1, 2], 'b': ['T', 'Z']}, ... index=['tiger', 'zebra']) >>> df.mean(numeric_only=True) a 1.5 dtype: float64
- Return type:
scalar or scalar
- median(axis: Axis | None = 0, skipna: bool = True, numeric_only: bool = False, **kwargs)
Return the median of the values over the requested axis.
- Parameters:
axis ({index (0)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.Added in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Returns:
Examples
>>> s = pd.Series([1, 2, 3]) >>> s.median() 2.0
With a DataFrame
>>> df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['tiger', 'zebra']) >>> df a b tiger 1 2 zebra 2 3 >>> df.median() a 1.5 b 2.5 dtype: float64
Using axis=1
>>> df.median(axis=1) tiger 1.5 zebra 2.5 dtype: float64
In this case, numeric_only should be set to True to avoid getting an error.
>>> df = pd.DataFrame({'a': [1, 2], 'b': ['T', 'Z']}, ... index=['tiger', 'zebra']) >>> df.median(numeric_only=True) a 1.5 dtype: float64
- Return type:
scalar or scalar
- memory_usage(index: bool = True, deep: bool = False) int
Return the memory usage of the Series.
The memory usage can optionally include the contribution of the index and of elements of object dtype.
- Parameters:
index (bool, default True) – Specifies whether to include the memory usage of the Series index.
deep (bool, default False) – If True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned value.
- Returns:
Bytes of memory consumed.
- Return type:
int
See also
numpy.ndarray.nbytesTotal bytes consumed by the elements of the array.
DataFrame.memory_usageBytes consumed by a DataFrame.
Examples
>>> s = pd.Series(range(3)) >>> s.memory_usage() 152
Not including the index gives the size of the rest of the data, which is necessarily smaller:
>>> s.memory_usage(index=False) 24
The memory footprint of object values is ignored by default:
>>> s = pd.Series(["a", "b"]) >>> s.values array(['a', 'b'], dtype=object) >>> s.memory_usage() 144 >>> s.memory_usage(deep=True) 244
- min(axis: Axis | None = 0, skipna: bool = True, numeric_only: bool = False, **kwargs)
Return the minimum of the values over the requested axis.
If you want the index of the minimum, use
idxmin. This is the equivalent of thenumpy.ndarraymethodargmin.- Parameters:
axis ({index (0)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.Added in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
scalar or scalar
See also
Series.sumReturn the sum.
Series.minReturn the minimum.
Series.maxReturn the maximum.
Series.idxminReturn the index of the minimum.
Series.idxmaxReturn the index of the maximum.
DataFrame.sumReturn the sum over the requested axis.
DataFrame.minReturn the minimum over the requested axis.
DataFrame.maxReturn the maximum over the requested axis.
DataFrame.idxminReturn the index of the minimum over the requested axis.
DataFrame.idxmaxReturn the index of the maximum over the requested axis.
Examples
>>> idx = pd.MultiIndex.from_arrays([ ... ['warm', 'warm', 'cold', 'cold'], ... ['dog', 'falcon', 'fish', 'spider']], ... names=['blooded', 'animal']) >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> s blooded animal warm dog 4 falcon 2 cold fish 0 spider 8 Name: legs, dtype: int64
>>> s.min() 0
- mod(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Modulo of series and other, element-wise (binary operator mod).
Equivalent to
series % other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.rmodReverse of the Modulo operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.mod(b, fill_value=0) a 0.0 b NaN c NaN d 0.0 e NaN dtype: float64
- mode(dropna: bool = True) Series
Return the mode(s) of the Series.
The mode is the value that appears most often. There can be multiple modes.
Always returns Series even if only one value is returned.
- Parameters:
dropna (bool, default True) – Don’t consider counts of NaN/NaT.
- Returns:
Modes of the Series in sorted order.
- Return type:
Examples
>>> s = pd.Series([2, 4, 2, 2, 4, None]) >>> s.mode() 0 2.0 dtype: float64
More than one mode:
>>> s = pd.Series([2, 4, 8, 2, 4, None]) >>> s.mode() 0 2.0 1 4.0 dtype: float64
With and without considering null value:
>>> s = pd.Series([2, 4, None, None, 4, None]) >>> s.mode(dropna=False) 0 NaN dtype: float64 >>> s = pd.Series([2, 4, None, None, 4, None]) >>> s.mode() 0 4.0 dtype: float64
- mul(other, level: Level | None = None, fill_value: float | None = None, axis: Axis = 0) Series
Return Multiplication of series and other, element-wise (binary operator mul).
Equivalent to
series * other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.rmulReverse of the Multiplication operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.multiply(b, fill_value=0) a 1.0 b 0.0 c 0.0 d 0.0 e NaN dtype: float64
- multiply(other, level: Level | None = None, fill_value: float | None = None, axis: Axis = 0) Series
Return Multiplication of series and other, element-wise (binary operator mul).
Equivalent to
series * other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.rmulReverse of the Multiplication operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.multiply(b, fill_value=0) a 1.0 b 0.0 c 0.0 d 0.0 e NaN dtype: float64
- property name: Hashable
Return the name of the Series.
The name of a Series becomes its index or column name if it is used to form a DataFrame. It is also used whenever displaying the Series using the interpreter.
- Returns:
The name of the Series, also the column name if part of a DataFrame.
- Return type:
label (hashable object)
See also
Series.renameSets the Series name when given a scalar input.
Index.nameCorresponding Index property.
Examples
The Series name can be set initially when calling the constructor.
>>> s = pd.Series([1, 2, 3], dtype=np.int64, name='Numbers') >>> s 0 1 1 2 2 3 Name: Numbers, dtype: int64 >>> s.name = "Integers" >>> s 0 1 1 2 2 3 Name: Integers, dtype: int64
The name of a Series within a DataFrame is its column name.
>>> df = pd.DataFrame([[1, 2], [3, 4], [5, 6]], ... columns=["Odd Numbers", "Even Numbers"]) >>> df Odd Numbers Even Numbers 0 1 2 1 3 4 2 5 6 >>> df["Even Numbers"].name 'Even Numbers'
- ne(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Not equal to of series and other, element-wise (binary operator ne).
Equivalent to
series != other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.ne(b, fill_value=0) a False b True c True d True e True dtype: bool
- nlargest(n: int = 5, keep: Literal['first', 'last', 'all'] = 'first') Series
Return the largest n elements.
- Parameters:
n (int, default 5) – Return this many descending sorted values.
keep ({'first', 'last', 'all'}, default 'first') –
When there are duplicate values that cannot all fit in a Series of n elements:
first: return the first n occurrences in order of appearance.last: return the last n occurrences in reverse order of appearance.all: keep all occurrences. This can result in a Series of size larger than n.
- Returns:
The n largest values in the Series, sorted in decreasing order.
- Return type:
See also
Series.nsmallestGet the n smallest elements.
Series.sort_valuesSort Series by values.
Series.headReturn the first n rows.
Notes
Faster than
.sort_values(ascending=False).head(n)for small n relative to the size of theSeriesobject.Examples
>>> countries_population = {"Italy": 59000000, "France": 65000000, ... "Malta": 434000, "Maldives": 434000, ... "Brunei": 434000, "Iceland": 337000, ... "Nauru": 11300, "Tuvalu": 11300, ... "Anguilla": 11300, "Montserrat": 5200} >>> s = pd.Series(countries_population) >>> s Italy 59000000 France 65000000 Malta 434000 Maldives 434000 Brunei 434000 Iceland 337000 Nauru 11300 Tuvalu 11300 Anguilla 11300 Montserrat 5200 dtype: int64
The n largest elements where
n=5by default.>>> s.nlargest() France 65000000 Italy 59000000 Malta 434000 Maldives 434000 Brunei 434000 dtype: int64
The n largest elements where
n=3. Default keep value is ‘first’ so Malta will be kept.>>> s.nlargest(3) France 65000000 Italy 59000000 Malta 434000 dtype: int64
The n largest elements where
n=3and keeping the last duplicates. Brunei will be kept since it is the last with value 434000 based on the index order.>>> s.nlargest(3, keep='last') France 65000000 Italy 59000000 Brunei 434000 dtype: int64
The n largest elements where
n=3with all duplicates kept. Note that the returned Series has five elements due to the three duplicates.>>> s.nlargest(3, keep='all') France 65000000 Italy 59000000 Malta 434000 Maldives 434000 Brunei 434000 dtype: int64
- notna() Series
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings
''ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True). NA values, such as None ornumpy.NaN, get mapped to False values.- Returns:
Mask of bool values for each element in Series that indicates whether an element is not an NA value.
- Return type:
See also
Series.notnullAlias of notna.
Series.isnaBoolean inverse of notna.
Series.dropnaOmit axes labels with missing values.
notnaTop-level notna.
Examples
Show which entries in a DataFrame are not NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.nan], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.notna() age born name toy 0 True False True False 1 True True True True 2 False True True True
Show which entries in a Series are not NA.
>>> ser = pd.Series([5, 6, np.nan]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.notna() 0 True 1 True 2 False dtype: bool
- notnull() Series
Series.notnull is an alias for Series.notna.
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings
''ornumpy.infare not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True). NA values, such as None ornumpy.NaN, get mapped to False values.- Returns:
Mask of bool values for each element in Series that indicates whether an element is not an NA value.
- Return type:
See also
Series.notnullAlias of notna.
Series.isnaBoolean inverse of notna.
Series.dropnaOmit axes labels with missing values.
notnaTop-level notna.
Examples
Show which entries in a DataFrame are not NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.nan], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.notna() age born name toy 0 True False True False 1 True True True True 2 False True True True
Show which entries in a Series are not NA.
>>> ser = pd.Series([5, 6, np.nan]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.notna() 0 True 1 True 2 False dtype: bool
- nsmallest(n: int = 5, keep: Literal['first', 'last', 'all'] = 'first') Series
Return the smallest n elements.
- Parameters:
n (int, default 5) – Return this many ascending sorted values.
keep ({'first', 'last', 'all'}, default 'first') –
When there are duplicate values that cannot all fit in a Series of n elements:
first: return the first n occurrences in order of appearance.last: return the last n occurrences in reverse order of appearance.all: keep all occurrences. This can result in a Series of size larger than n.
- Returns:
The n smallest values in the Series, sorted in increasing order.
- Return type:
See also
Series.nlargestGet the n largest elements.
Series.sort_valuesSort Series by values.
Series.headReturn the first n rows.
Notes
Faster than
.sort_values().head(n)for small n relative to the size of theSeriesobject.Examples
>>> countries_population = {"Italy": 59000000, "France": 65000000, ... "Brunei": 434000, "Malta": 434000, ... "Maldives": 434000, "Iceland": 337000, ... "Nauru": 11300, "Tuvalu": 11300, ... "Anguilla": 11300, "Montserrat": 5200} >>> s = pd.Series(countries_population) >>> s Italy 59000000 France 65000000 Brunei 434000 Malta 434000 Maldives 434000 Iceland 337000 Nauru 11300 Tuvalu 11300 Anguilla 11300 Montserrat 5200 dtype: int64
The n smallest elements where
n=5by default.>>> s.nsmallest() Montserrat 5200 Nauru 11300 Tuvalu 11300 Anguilla 11300 Iceland 337000 dtype: int64
The n smallest elements where
n=3. Default keep value is ‘first’ so Nauru and Tuvalu will be kept.>>> s.nsmallest(3) Montserrat 5200 Nauru 11300 Tuvalu 11300 dtype: int64
The n smallest elements where
n=3and keeping the last duplicates. Anguilla and Tuvalu will be kept since they are the last with value 11300 based on the index order.>>> s.nsmallest(3, keep='last') Montserrat 5200 Anguilla 11300 Tuvalu 11300 dtype: int64
The n smallest elements where
n=3with all duplicates kept. Note that the returned Series has four elements due to the three duplicates.>>> s.nsmallest(3, keep='all') Montserrat 5200 Nauru 11300 Tuvalu 11300 Anguilla 11300 dtype: int64
- plot
alias of
PlotAccessor
- pop(item: Hashable) Any
Return item and drops from series. Raise KeyError if not found.
- Parameters:
item (label) – Index of the element that needs to be removed.
- Return type:
Value that is popped from series.
Examples
>>> ser = pd.Series([1, 2, 3])
>>> ser.pop(0) 1
>>> ser 1 2 2 3 dtype: int64
- pow(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Exponential power of series and other, element-wise (binary operator pow).
Equivalent to
series ** other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.rpowReverse of the Exponential power operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.pow(b, fill_value=0) a 1.0 b 1.0 c 1.0 d 0.0 e NaN dtype: float64
- prod(axis: Axis | None = None, skipna: bool = True, numeric_only: bool = False, min_count: int = 0, **kwargs)
Return the product of the values over the requested axis.
- Parameters:
axis ({index (0)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.prod with
axis=Noneis deprecated, in a future version this will reduce over both axes and return a scalar To retain the old behavior, pass axis=0 (or do not pass axis).Added in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than
min_countnon-NA values are present the result will be NA.**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
scalar or scalar
See also
Series.sumReturn the sum.
Series.minReturn the minimum.
Series.maxReturn the maximum.
Series.idxminReturn the index of the minimum.
Series.idxmaxReturn the index of the maximum.
DataFrame.sumReturn the sum over the requested axis.
DataFrame.minReturn the minimum over the requested axis.
DataFrame.maxReturn the maximum over the requested axis.
DataFrame.idxminReturn the index of the minimum over the requested axis.
DataFrame.idxmaxReturn the index of the maximum over the requested axis.
Examples
By default, the product of an empty or all-NA Series is
1>>> pd.Series([], dtype="float64").prod() 1.0
This can be controlled with the
min_countparameter>>> pd.Series([], dtype="float64").prod(min_count=1) nan
Thanks to the
skipnaparameter,min_counthandles all-NA and empty series identically.>>> pd.Series([np.nan]).prod() 1.0
>>> pd.Series([np.nan]).prod(min_count=1) nan
- product(axis: Axis | None = None, skipna: bool = True, numeric_only: bool = False, min_count: int = 0, **kwargs)
Return the product of the values over the requested axis.
- Parameters:
axis ({index (0)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.prod with
axis=Noneis deprecated, in a future version this will reduce over both axes and return a scalar To retain the old behavior, pass axis=0 (or do not pass axis).Added in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than
min_countnon-NA values are present the result will be NA.**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
scalar or scalar
See also
Series.sumReturn the sum.
Series.minReturn the minimum.
Series.maxReturn the maximum.
Series.idxminReturn the index of the minimum.
Series.idxmaxReturn the index of the maximum.
DataFrame.sumReturn the sum over the requested axis.
DataFrame.minReturn the minimum over the requested axis.
DataFrame.maxReturn the maximum over the requested axis.
DataFrame.idxminReturn the index of the minimum over the requested axis.
DataFrame.idxmaxReturn the index of the maximum over the requested axis.
Examples
By default, the product of an empty or all-NA Series is
1>>> pd.Series([], dtype="float64").prod() 1.0
This can be controlled with the
min_countparameter>>> pd.Series([], dtype="float64").prod(min_count=1) nan
Thanks to the
skipnaparameter,min_counthandles all-NA and empty series identically.>>> pd.Series([np.nan]).prod() 1.0
>>> pd.Series([np.nan]).prod(min_count=1) nan
- quantile(q: float = 0.5, interpolation: QuantileInterpolation = 'linear') float
- quantile(q: Sequence[float] | AnyArrayLike, interpolation: QuantileInterpolation = 'linear') Series
- quantile(q: float | Sequence[float] | AnyArrayLike = 0.5, interpolation: QuantileInterpolation = 'linear') float | Series
Return value at the given quantile.
- Parameters:
q (float or array-like, default 0.5 (50% quantile)) – The quantile(s) to compute, which can lie in range: 0 <= q <= 1.
interpolation ({'linear', 'lower', 'higher', 'midpoint', 'nearest'}) –
This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:
linear: i + (j - i) * (x-i)/(j-i), where (x-i)/(j-i) is the fractional part of the index surrounded by i > j.
lower: i.
higher: j.
nearest: i or j whichever is nearest.
midpoint: (i + j) / 2.
- Returns:
If
qis an array, a Series will be returned where the index isqand the values are the quantiles, otherwise a float will be returned.- Return type:
float or Series
See also
core.window.Rolling.quantileCalculate the rolling quantile.
numpy.percentileReturns the q-th percentile(s) of the array elements.
Examples
>>> s = pd.Series([1, 2, 3, 4]) >>> s.quantile(.5) 2.5 >>> s.quantile([.25, .5, .75]) 0.25 1.75 0.50 2.50 0.75 3.25 dtype: float64
- radd(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Addition of series and other, element-wise (binary operator radd).
Equivalent to
other + series, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.addElement-wise Addition, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
- ravel(order: str = 'C') ArrayLike
Return the flattened underlying data as an ndarray or ExtensionArray.
Deprecated since version 2.2.0: Series.ravel is deprecated. The underlying array is already 1D, so ravel is not necessary. Use
to_numpy()for conversion to a numpy array instead.- Returns:
Flattened data of the Series.
- Return type:
numpy.ndarray or ExtensionArray
See also
numpy.ndarray.ravelReturn a flattened array.
Examples
>>> s = pd.Series([1, 2, 3]) >>> s.ravel() array([1, 2, 3])
- rdiv(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Floating division of series and other, element-wise (binary operator rtruediv).
Equivalent to
other / series, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.truedivElement-wise Floating division, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.divide(b, fill_value=0) a 1.0 b inf c inf d 0.0 e NaN dtype: float64
- rdivmod(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Integer division and modulo of series and other, element-wise (binary operator rdivmod).
Equivalent to
other divmod series, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
2-Tuple of Series
See also
Series.divmodElement-wise Integer division and modulo, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.divmod(b, fill_value=0) (a 1.0 b inf c inf d 0.0 e NaN dtype: float64, a 0.0 b NaN c NaN d 0.0 e NaN dtype: float64)
- reindex(index=None, *, axis: Axis | None = None, method: ReindexMethod | None = None, copy: bool | None = None, level: Level | None = None, fill_value: Scalar | None = None, limit: int | None = None, tolerance=None) Series
Conform Series to new index with optional filling logic.
Places NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and
copy=False.- Parameters:
index (array-like, optional) – New labels for the index. Preferably an Index object to avoid duplicating data.
axis (int or str, optional) – Unused.
method ({None, 'backfill'/'bfill', 'pad'/'ffill', 'nearest'}) –
Method to use for filling holes in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index.
None (default): don’t fill gaps
pad / ffill: Propagate last valid observation forward to next valid.
backfill / bfill: Use next valid observation to fill gap.
nearest: Use nearest valid observations to fill gap.
copy (bool, default True) –
Return a new object, even if the passed indexes are the same.
Note
The copy keyword will change behavior in pandas 3.0. Copy-on-Write will be enabled by default, which means that all methods with a copy keyword will use a lazy copy mechanism to defer the copy and ignore the copy keyword. The copy keyword will be removed in a future version of pandas.
You can already get the future behavior and improvements through enabling copy on write
pd.options.mode.copy_on_write = Truelevel (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (scalar, default np.nan) – Value to use for missing values. Defaults to NaN, but can be any “compatible” value.
limit (int, default None) – Maximum number of consecutive elements to forward or backward fill.
tolerance (optional) –
Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation
abs(index[indexer] - target) <= tolerance.Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type.
- Return type:
Series with changed index.
See also
DataFrame.set_indexSet row labels.
DataFrame.reset_indexRemove row labels or move them to new columns.
DataFrame.reindex_likeChange to same indices as other DataFrame.
Examples
DataFrame.reindexsupports two calling conventions(index=index_labels, columns=column_labels, ...)(labels, axis={'index', 'columns'}, ...)
We highly recommend using keyword arguments to clarify your intent.
Create a dataframe with some fictional data.
>>> index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror'] >>> df = pd.DataFrame({'http_status': [200, 200, 404, 404, 301], ... 'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]}, ... index=index) >>> df http_status response_time Firefox 200 0.04 Chrome 200 0.02 Safari 404 0.07 IE10 404 0.08 Konqueror 301 1.00
Create a new index and reindex the dataframe. By default values in the new index that do not have corresponding records in the dataframe are assigned
NaN.>>> new_index = ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10', ... 'Chrome'] >>> df.reindex(new_index) http_status response_time Safari 404.0 0.07 Iceweasel NaN NaN Comodo Dragon NaN NaN IE10 404.0 0.08 Chrome 200.0 0.02
We can fill in the missing values by passing a value to the keyword
fill_value. Because the index is not monotonically increasing or decreasing, we cannot use arguments to the keywordmethodto fill theNaNvalues.>>> df.reindex(new_index, fill_value=0) http_status response_time Safari 404 0.07 Iceweasel 0 0.00 Comodo Dragon 0 0.00 IE10 404 0.08 Chrome 200 0.02
>>> df.reindex(new_index, fill_value='missing') http_status response_time Safari 404 0.07 Iceweasel missing missing Comodo Dragon missing missing IE10 404 0.08 Chrome 200 0.02
We can also reindex the columns.
>>> df.reindex(columns=['http_status', 'user_agent']) http_status user_agent Firefox 200 NaN Chrome 200 NaN Safari 404 NaN IE10 404 NaN Konqueror 301 NaN
Or we can use “axis-style” keyword arguments
>>> df.reindex(['http_status', 'user_agent'], axis="columns") http_status user_agent Firefox 200 NaN Chrome 200 NaN Safari 404 NaN IE10 404 NaN Konqueror 301 NaN
To further illustrate the filling functionality in
reindex, we will create a dataframe with a monotonically increasing index (for example, a sequence of dates).>>> date_index = pd.date_range('1/1/2010', periods=6, freq='D') >>> df2 = pd.DataFrame({"prices": [100, 101, np.nan, 100, 89, 88]}, ... index=date_index) >>> df2 prices 2010-01-01 100.0 2010-01-02 101.0 2010-01-03 NaN 2010-01-04 100.0 2010-01-05 89.0 2010-01-06 88.0
Suppose we decide to expand the dataframe to cover a wider date range.
>>> date_index2 = pd.date_range('12/29/2009', periods=10, freq='D') >>> df2.reindex(date_index2) prices 2009-12-29 NaN 2009-12-30 NaN 2009-12-31 NaN 2010-01-01 100.0 2010-01-02 101.0 2010-01-03 NaN 2010-01-04 100.0 2010-01-05 89.0 2010-01-06 88.0 2010-01-07 NaN
The index entries that did not have a value in the original data frame (for example, ‘2009-12-29’) are by default filled with
NaN. If desired, we can fill in the missing values using one of several options.For example, to back-propagate the last valid value to fill the
NaNvalues, passbfillas an argument to themethodkeyword.>>> df2.reindex(date_index2, method='bfill') prices 2009-12-29 100.0 2009-12-30 100.0 2009-12-31 100.0 2010-01-01 100.0 2010-01-02 101.0 2010-01-03 NaN 2010-01-04 100.0 2010-01-05 89.0 2010-01-06 88.0 2010-01-07 NaN
Please note that the
NaNvalue present in the original dataframe (at index value 2010-01-03) will not be filled by any of the value propagation schemes. This is because filling while reindexing does not look at dataframe values, but only compares the original and desired indexes. If you do want to fill in theNaNvalues present in the original dataframe, use thefillna()method.See the user guide for more.
- rename(index: Renamer | Hashable | None = None, *, axis: Axis | None = None, copy: bool = None, inplace: Literal[True], level: Level | None = None, errors: IgnoreRaise = 'ignore') None
- rename(index: Renamer | Hashable | None = None, *, axis: Axis | None = None, copy: bool = None, inplace: Literal[False] = False, level: Level | None = None, errors: IgnoreRaise = 'ignore') Series
- rename(index: Renamer | Hashable | None = None, *, axis: Axis | None = None, copy: bool = None, inplace: bool = False, level: Level | None = None, errors: IgnoreRaise = 'ignore') Series | None
Alter Series index labels or name.
Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.
Alternatively, change
Series.namewith a scalar value.See the user guide for more.
- Parameters:
index (scalar, hashable sequence, dict-like or function optional) – Functions or dict-like are transformations to apply to the index. Scalar or hashable sequence-like will alter the
Series.nameattribute.axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
copy (bool, default True) –
Also copy underlying data.
Note
The copy keyword will change behavior in pandas 3.0. Copy-on-Write will be enabled by default, which means that all methods with a copy keyword will use a lazy copy mechanism to defer the copy and ignore the copy keyword. The copy keyword will be removed in a future version of pandas.
You can already get the future behavior and improvements through enabling copy on write
pd.options.mode.copy_on_write = Trueinplace (bool, default False) – Whether to return a new Series. If True the value of copy is ignored.
level (int or level name, default None) – In case of MultiIndex, only rename labels in the specified level.
errors ({'ignore', 'raise'}, default 'ignore') – If ‘raise’, raise KeyError when a dict-like mapper or index contains labels that are not present in the index being transformed. If ‘ignore’, existing keys will be renamed and extra keys will be ignored.
- Returns:
Series with index labels or name altered or None if
inplace=True.- Return type:
Series or None
See also
DataFrame.renameCorresponding DataFrame method.
Series.rename_axisSet the name of the axis.
Examples
>>> s = pd.Series([1, 2, 3]) >>> s 0 1 1 2 2 3 dtype: int64 >>> s.rename("my_name") # scalar, changes Series.name 0 1 1 2 2 3 Name: my_name, dtype: int64 >>> s.rename(lambda x: x ** 2) # function, changes labels 0 1 1 2 4 3 dtype: int64 >>> s.rename({1: 3, 2: 5}) # mapping, changes labels 0 1 3 2 5 3 dtype: int64
- rename_axis(mapper: IndexLabel | lib.NoDefault = <no_default>, *, index=<no_default>, axis: Axis = 0, copy: bool = True, inplace: Literal[True]) None
- rename_axis(mapper: IndexLabel | lib.NoDefault = <no_default>, *, index=<no_default>, axis: Axis = 0, copy: bool = True, inplace: Literal[False] = False) Self
- rename_axis(mapper: IndexLabel | lib.NoDefault = <no_default>, *, index=<no_default>, axis: Axis = 0, copy: bool = True, inplace: bool = False) Self | None
Set the name of the axis for the index or columns.
- Parameters:
mapper (scalar, list-like, optional) – Value to set the axis name attribute.
index (scalar, list-like, dict-like or function, optional) –
A scalar, list-like, dict-like or functions transformations to apply to that axis’ values. Note that the
columnsparameter is not allowed if the object is a Series. This parameter only apply for DataFrame type objects.Use either
mapperandaxisto specify the axis to target withmapper, orindexand/orcolumns.columns (scalar, list-like, dict-like or function, optional) –
A scalar, list-like, dict-like or functions transformations to apply to that axis’ values. Note that the
columnsparameter is not allowed if the object is a Series. This parameter only apply for DataFrame type objects.Use either
mapperandaxisto specify the axis to target withmapper, orindexand/orcolumns.axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to rename. For Series this parameter is unused and defaults to 0.
copy (bool, default None) –
Also copy underlying data.
Note
The copy keyword will change behavior in pandas 3.0. Copy-on-Write will be enabled by default, which means that all methods with a copy keyword will use a lazy copy mechanism to defer the copy and ignore the copy keyword. The copy keyword will be removed in a future version of pandas.
You can already get the future behavior and improvements through enabling copy on write
pd.options.mode.copy_on_write = Trueinplace (bool, default False) – Modifies the object directly, instead of creating a new Series or DataFrame.
- Returns:
The same type as the caller or None if
inplace=True.- Return type:
See also
Series.renameAlter Series index labels or name.
DataFrame.renameAlter DataFrame index labels or name.
Index.renameSet new names on index.
Notes
DataFrame.rename_axissupports two calling conventions(index=index_mapper, columns=columns_mapper, ...)(mapper, axis={'index', 'columns'}, ...)
The first calling convention will only modify the names of the index and/or the names of the Index object that is the columns. In this case, the parameter
copyis ignored.The second calling convention will modify the names of the corresponding index if mapper is a list or a scalar. However, if mapper is dict-like or a function, it will use the deprecated behavior of modifying the axis labels.
We highly recommend using keyword arguments to clarify your intent.
Examples
Series
>>> s = pd.Series(["dog", "cat", "monkey"]) >>> s 0 dog 1 cat 2 monkey dtype: object >>> s.rename_axis("animal") animal 0 dog 1 cat 2 monkey dtype: object
DataFrame
>>> df = pd.DataFrame({"num_legs": [4, 4, 2], ... "num_arms": [0, 0, 2]}, ... ["dog", "cat", "monkey"]) >>> df num_legs num_arms dog 4 0 cat 4 0 monkey 2 2 >>> df = df.rename_axis("animal") >>> df num_legs num_arms animal dog 4 0 cat 4 0 monkey 2 2 >>> df = df.rename_axis("limbs", axis="columns") >>> df limbs num_legs num_arms animal dog 4 0 cat 4 0 monkey 2 2
MultiIndex
>>> df.index = pd.MultiIndex.from_product([['mammal'], ... ['dog', 'cat', 'monkey']], ... names=['type', 'name']) >>> df limbs num_legs num_arms type name mammal dog 4 0 cat 4 0 monkey 2 2
>>> df.rename_axis(index={'type': 'class'}) limbs num_legs num_arms class name mammal dog 4 0 cat 4 0 monkey 2 2
>>> df.rename_axis(columns=str.upper) LIMBS num_legs num_arms type name mammal dog 4 0 cat 4 0 monkey 2 2
- reorder_levels(order: Sequence[Level]) Series
Rearrange index levels using input order.
May not drop or duplicate levels.
- Parameters:
order (list of int representing new level order) – Reference level by number or key.
- Return type:
type of caller (new object)
Examples
>>> arrays = [np.array(["dog", "dog", "cat", "cat", "bird", "bird"]), ... np.array(["white", "black", "white", "black", "white", "black"])] >>> s = pd.Series([1, 2, 3, 3, 5, 2], index=arrays) >>> s dog white 1 black 2 cat white 3 black 3 bird white 5 black 2 dtype: int64 >>> s.reorder_levels([1, 0]) white dog 1 black dog 2 white cat 3 black cat 3 white bird 5 black bird 2 dtype: int64
- repeat(repeats: int | Sequence[int], axis: None = None) Series
Repeat elements of a Series.
Returns a new Series where each element of the current Series is repeated consecutively a given number of times.
- Parameters:
repeats (int or array of ints) – The number of repetitions for each element. This should be a non-negative integer. Repeating 0 times will return an empty Series.
axis (None) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
Newly created Series with repeated elements.
- Return type:
See also
Index.repeatEquivalent function for Index.
numpy.repeatSimilar method for
numpy.ndarray.
Examples
>>> s = pd.Series(['a', 'b', 'c']) >>> s 0 a 1 b 2 c dtype: object >>> s.repeat(2) 0 a 0 a 1 b 1 b 2 c 2 c dtype: object >>> s.repeat([1, 2, 3]) 0 a 1 b 1 b 2 c 2 c 2 c dtype: object
- reset_index(level: IndexLabel = None, *, drop: Literal[False] = False, name: Level = <no_default>, inplace: Literal[False] = False, allow_duplicates: bool = False) DataFrame
- reset_index(level: IndexLabel = None, *, drop: Literal[True], name: Level = <no_default>, inplace: Literal[False] = False, allow_duplicates: bool = False) Series
- reset_index(level: IndexLabel = None, *, drop: bool = False, name: Level = <no_default>, inplace: Literal[True], allow_duplicates: bool = False) None
Generate a new DataFrame or Series with the index reset.
This is useful when the index needs to be treated as a column, or when the index is meaningless and needs to be reset to the default before another operation.
- Parameters:
level (int, str, tuple, or list, default optional) – For a Series with a MultiIndex, only remove the specified levels from the index. Removes all levels by default.
drop (bool, default False) – Just reset the index, without inserting it as a column in the new DataFrame.
name (object, optional) – The name to use for the column containing the original Series values. Uses
self.nameby default. This argument is ignored when drop is True.inplace (bool, default False) – Modify the Series in place (do not create a new object).
allow_duplicates (bool, default False) –
Allow duplicate column labels to be created.
Added in version 1.5.0.
- Returns:
When drop is False (the default), a DataFrame is returned. The newly created columns will come first in the DataFrame, followed by the original Series values. When drop is True, a Series is returned. In either case, if
inplace=True, no value is returned.- Return type:
See also
DataFrame.reset_indexAnalogous function for DataFrame.
Examples
>>> s = pd.Series([1, 2, 3, 4], name='foo', ... index=pd.Index(['a', 'b', 'c', 'd'], name='idx'))
Generate a DataFrame with default index.
>>> s.reset_index() idx foo 0 a 1 1 b 2 2 c 3 3 d 4
To specify the name of the new column use name.
>>> s.reset_index(name='values') idx values 0 a 1 1 b 2 2 c 3 3 d 4
To generate a new Series with the default set drop to True.
>>> s.reset_index(drop=True) 0 1 1 2 2 3 3 4 Name: foo, dtype: int64
The level parameter is interesting for Series with a multi-level index.
>>> arrays = [np.array(['bar', 'bar', 'baz', 'baz']), ... np.array(['one', 'two', 'one', 'two'])] >>> s2 = pd.Series( ... range(4), name='foo', ... index=pd.MultiIndex.from_arrays(arrays, ... names=['a', 'b']))
To remove a specific level from the Index, use level.
>>> s2.reset_index(level='a') a foo b one bar 0 two bar 1 one baz 2 two baz 3
If level is not set, all levels are removed from the Index.
>>> s2.reset_index() a b foo 0 bar one 0 1 bar two 1 2 baz one 2 3 baz two 3
- rfloordiv(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Integer division of series and other, element-wise (binary operator rfloordiv).
Equivalent to
other // series, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.floordivElement-wise Integer division, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.floordiv(b, fill_value=0) a 1.0 b inf c inf d 0.0 e NaN dtype: float64
- rmod(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Modulo of series and other, element-wise (binary operator rmod).
Equivalent to
other % series, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.modElement-wise Modulo, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.mod(b, fill_value=0) a 0.0 b NaN c NaN d 0.0 e NaN dtype: float64
- rmul(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Multiplication of series and other, element-wise (binary operator rmul).
Equivalent to
other * series, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.mulElement-wise Multiplication, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.multiply(b, fill_value=0) a 1.0 b 0.0 c 0.0 d 0.0 e NaN dtype: float64
- round(decimals: int = 0, *args, **kwargs) Series
Round each value in a Series to the given number of decimals.
- Parameters:
decimals (int, default 0) – Number of decimal places to round to. If decimals is negative, it specifies the number of positions to the left of the decimal point.
*args – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.
- Returns:
Rounded values of the Series.
- Return type:
See also
numpy.aroundRound values of an np.array.
DataFrame.roundRound values of a DataFrame.
Examples
>>> s = pd.Series([0.1, 1.3, 2.7]) >>> s.round() 0 0.0 1 1.0 2 3.0 dtype: float64
- rpow(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Exponential power of series and other, element-wise (binary operator rpow).
Equivalent to
other ** series, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.powElement-wise Exponential power, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.pow(b, fill_value=0) a 1.0 b 1.0 c 1.0 d 0.0 e NaN dtype: float64
- rsub(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Subtraction of series and other, element-wise (binary operator rsub).
Equivalent to
other - series, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.subElement-wise Subtraction, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.subtract(b, fill_value=0) a 0.0 b 1.0 c 1.0 d -1.0 e NaN dtype: float64
- rtruediv(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Floating division of series and other, element-wise (binary operator rtruediv).
Equivalent to
other / series, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.truedivElement-wise Floating division, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.divide(b, fill_value=0) a 1.0 b inf c inf d 0.0 e NaN dtype: float64
- searchsorted(value: NumpyValueArrayLike | ExtensionArray, side: Literal['left', 'right'] = 'left', sorter: NumpySorter | None = None) npt.NDArray[np.intp] | np.intp
Find indices where elements should be inserted to maintain order.
Find the indices into a sorted Series self such that, if the corresponding elements in value were inserted before the indices, the order of self would be preserved.
Note
The Series must be monotonically sorted, otherwise wrong locations will likely be returned. Pandas does not check this for you.
- Parameters:
value (array-like or scalar) – Values to insert into self.
side ({'left', 'right'}, optional) – If ‘left’, the index of the first suitable location found is given. If ‘right’, return the last such index. If there is no suitable index, return either 0 or N (where N is the length of self).
sorter (1-D array-like, optional) – Optional array of integer indices that sort self into ascending order. They are typically the result of
np.argsort.
- Returns:
A scalar or array of insertion points with the same shape as value.
- Return type:
int or array of int
See also
sort_valuesSort by the values along either axis.
numpy.searchsortedSimilar method from NumPy.
Notes
Binary search is used to find the required insertion points.
Examples
>>> ser = pd.Series([1, 2, 3]) >>> ser 0 1 1 2 2 3 dtype: int64
>>> ser.searchsorted(4) 3
>>> ser.searchsorted([0, 4]) array([0, 3])
>>> ser.searchsorted([1, 3], side='left') array([0, 2])
>>> ser.searchsorted([1, 3], side='right') array([1, 3])
>>> ser = pd.Series(pd.to_datetime(['3/11/2000', '3/12/2000', '3/13/2000'])) >>> ser 0 2000-03-11 1 2000-03-12 2 2000-03-13 dtype: datetime64[ns]
>>> ser.searchsorted('3/14/2000') 3
>>> ser = pd.Categorical( ... ['apple', 'bread', 'bread', 'cheese', 'milk'], ordered=True ... ) >>> ser ['apple', 'bread', 'bread', 'cheese', 'milk'] Categories (4, object): ['apple' < 'bread' < 'cheese' < 'milk']
>>> ser.searchsorted('bread') 1
>>> ser.searchsorted(['bread'], side='right') array([3])
If the values are not monotonically sorted, wrong locations may be returned:
>>> ser = pd.Series([2, 1, 3]) >>> ser 0 2 1 1 2 3 dtype: int64
>>> ser.searchsorted(1) 0 # wrong result, correct would be 1
- sem(axis: Axis | None = None, skipna: bool = True, ddof: int = 1, numeric_only: bool = False, **kwargs)
Return unbiased standard error of the mean over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
- Parameters:
axis ({index (0)}) –
For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.sem with
axis=Noneis deprecated, in a future version this will reduce over both axes and return a scalar To retain the old behavior, pass axis=0 (or do not pass axis).skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
- Returns:
Examples
>>> s = pd.Series([1, 2, 3]) >>> s.sem().round(6) 0.57735
With a DataFrame
>>> df = pd.DataFrame({'a': [1, 2], 'b': [2, 3]}, index=['tiger', 'zebra']) >>> df a b tiger 1 2 zebra 2 3 >>> df.sem() a 0.5 b 0.5 dtype: float64
Using axis=1
>>> df.sem(axis=1) tiger 0.5 zebra 0.5 dtype: float64
In this case, numeric_only should be set to True to avoid getting an error.
>>> df = pd.DataFrame({'a': [1, 2], 'b': ['T', 'Z']}, ... index=['tiger', 'zebra']) >>> df.sem(numeric_only=True) a 0.5 dtype: float64
- Return type:
scalar or Series (if level specified)
- set_axis(labels, *, axis: Axis = 0, copy: bool | None = None) Series
Assign desired index to given axis.
Indexes for row labels can be changed by assigning a list-like or Index.
- Parameters:
labels (list-like, Index) – The values for the new index.
axis ({0 or 'index'}, default 0) – The axis to update. The value 0 identifies the rows. For Series this parameter is unused and defaults to 0.
copy (bool, default True) –
Whether to make a copy of the underlying data.
Note
The copy keyword will change behavior in pandas 3.0. Copy-on-Write will be enabled by default, which means that all methods with a copy keyword will use a lazy copy mechanism to defer the copy and ignore the copy keyword. The copy keyword will be removed in a future version of pandas.
You can already get the future behavior and improvements through enabling copy on write
pd.options.mode.copy_on_write = True
- Returns:
An object of type Series.
- Return type:
See also
Series.rename_axisAlter the name of the index. Examples ——– >>> s = pd.Series([1, 2, 3]) >>> s 0 1 1 2 2 3
dtypeint64 >>> s.set_axis([‘a’, ‘b’, ‘c’], axis=0) a 1 b 2 c 3
dtypeint64
- skew(axis: Axis | None = 0, skipna: bool = True, numeric_only: bool = False, **kwargs)
Return unbiased skew over requested axis.
Normalized by N-1.
- Parameters:
axis ({index (0)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
For DataFrames, specifying
axis=Nonewill apply the aggregation across both axes.Added in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Returns:
Examples
>>> s = pd.Series([1, 2, 3]) >>> s.skew() 0.0
With a DataFrame
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [2, 3, 4], 'c': [1, 3, 5]}, ... index=['tiger', 'zebra', 'cow']) >>> df a b c tiger 1 2 1 zebra 2 3 3 cow 3 4 5 >>> df.skew() a 0.0 b 0.0 c 0.0 dtype: float64
Using axis=1
>>> df.skew(axis=1) tiger 1.732051 zebra -1.732051 cow 0.000000 dtype: float64
In this case, numeric_only should be set to True to avoid getting an error.
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': ['T', 'Z', 'X']}, ... index=['tiger', 'zebra', 'cow']) >>> df.skew(numeric_only=True) a 0.0 dtype: float64
- Return type:
scalar or scalar
- sort_index(*, axis: Axis = 0, level: IndexLabel = None, ascending: bool | Sequence[bool] = True, inplace: Literal[True], kind: SortKind = 'quicksort', na_position: NaPosition = 'last', sort_remaining: bool = True, ignore_index: bool = False, key: IndexKeyFunc = None) None
- sort_index(*, axis: Axis = 0, level: IndexLabel = None, ascending: bool | Sequence[bool] = True, inplace: Literal[False] = False, kind: SortKind = 'quicksort', na_position: NaPosition = 'last', sort_remaining: bool = True, ignore_index: bool = False, key: IndexKeyFunc = None) Series
- sort_index(*, axis: Axis = 0, level: IndexLabel = None, ascending: bool | Sequence[bool] = True, inplace: bool = False, kind: SortKind = 'quicksort', na_position: NaPosition = 'last', sort_remaining: bool = True, ignore_index: bool = False, key: IndexKeyFunc = None) Series | None
Sort Series by index labels.
Returns a new Series sorted by label if inplace argument is
False, otherwise updates the original series and returns None.- Parameters:
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
level (int, optional) – If not None, sort on values in specified index level(s).
ascending (bool or list-like of bools, default True) – Sort ascending vs. descending. When the index is a MultiIndex the sort direction can be controlled for each level individually.
inplace (bool, default False) – If True, perform operation in-place.
kind ({'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort') – Choice of sorting algorithm. See also
numpy.sort()for more information. ‘mergesort’ and ‘stable’ are the only stable algorithms. For DataFrames, this option is only applied when sorting on a single column or label.na_position ({'first', 'last'}, default 'last') – If ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end. Not implemented for MultiIndex.
sort_remaining (bool, default True) – If True and sorting by level and index is multilevel, sort by other levels too (in order) after sorting by specified level.
ignore_index (bool, default False) – If True, the resulting axis will be labeled 0, 1, …, n - 1.
key (callable, optional) – If not None, apply the key function to the index values before sorting. This is similar to the key argument in the builtin
sorted()function, with the notable difference that this key function should be vectorized. It should expect anIndexand return anIndexof the same shape.
- Returns:
The original Series sorted by the labels or None if
inplace=True.- Return type:
Series or None
See also
DataFrame.sort_indexSort DataFrame by the index.
DataFrame.sort_valuesSort DataFrame by the value.
Series.sort_valuesSort Series by the value.
Examples
>>> s = pd.Series(['a', 'b', 'c', 'd'], index=[3, 2, 1, 4]) >>> s.sort_index() 1 c 2 b 3 a 4 d dtype: object
Sort Descending
>>> s.sort_index(ascending=False) 4 d 3 a 2 b 1 c dtype: object
By default NaNs are put at the end, but use na_position to place them at the beginning
>>> s = pd.Series(['a', 'b', 'c', 'd'], index=[3, 2, 1, np.nan]) >>> s.sort_index(na_position='first') NaN d 1.0 c 2.0 b 3.0 a dtype: object
Specify index level to sort
>>> arrays = [np.array(['qux', 'qux', 'foo', 'foo', ... 'baz', 'baz', 'bar', 'bar']), ... np.array(['two', 'one', 'two', 'one', ... 'two', 'one', 'two', 'one'])] >>> s = pd.Series([1, 2, 3, 4, 5, 6, 7, 8], index=arrays) >>> s.sort_index(level=1) bar one 8 baz one 6 foo one 4 qux one 2 bar two 7 baz two 5 foo two 3 qux two 1 dtype: int64
Does not sort by remaining levels when sorting by levels
>>> s.sort_index(level=1, sort_remaining=False) qux one 2 foo one 4 baz one 6 bar one 8 qux two 1 foo two 3 baz two 5 bar two 7 dtype: int64
Apply a key function before sorting
>>> s = pd.Series([1, 2, 3, 4], index=['A', 'b', 'C', 'd']) >>> s.sort_index(key=lambda x : x.str.lower()) A 1 b 2 C 3 d 4 dtype: int64
- sort_values(*, axis: Axis = 0, ascending: bool | Sequence[bool] = True, inplace: Literal[False] = False, kind: SortKind = 'quicksort', na_position: NaPosition = 'last', ignore_index: bool = False, key: ValueKeyFunc = None) Series
- sort_values(*, axis: Axis = 0, ascending: bool | Sequence[bool] = True, inplace: Literal[True], kind: SortKind = 'quicksort', na_position: NaPosition = 'last', ignore_index: bool = False, key: ValueKeyFunc = None) None
- sort_values(*, axis: Axis = 0, ascending: bool | Sequence[bool] = True, inplace: bool = False, kind: SortKind = 'quicksort', na_position: NaPosition = 'last', ignore_index: bool = False, key: ValueKeyFunc = None) Series | None
Sort by the values.
Sort a Series in ascending or descending order by some criterion.
- Parameters:
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
ascending (bool or list of bools, default True) – If True, sort values in ascending order, otherwise descending.
inplace (bool, default False) – If True, perform operation in-place.
kind ({'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort') – Choice of sorting algorithm. See also
numpy.sort()for more information. ‘mergesort’ and ‘stable’ are the only stable algorithms.na_position ({'first' or 'last'}, default 'last') – Argument ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end.
ignore_index (bool, default False) – If True, the resulting axis will be labeled 0, 1, …, n - 1.
key (callable, optional) – If not None, apply the key function to the series values before sorting. This is similar to the key argument in the builtin
sorted()function, with the notable difference that this key function should be vectorized. It should expect aSeriesand return an array-like.
- Returns:
Series ordered by values or None if
inplace=True.- Return type:
Series or None
See also
Series.sort_indexSort by the Series indices.
DataFrame.sort_valuesSort DataFrame by the values along either axis.
DataFrame.sort_indexSort DataFrame by indices.
Examples
>>> s = pd.Series([np.nan, 1, 3, 10, 5]) >>> s 0 NaN 1 1.0 2 3.0 3 10.0 4 5.0 dtype: float64
Sort values ascending order (default behaviour)
>>> s.sort_values(ascending=True) 1 1.0 2 3.0 4 5.0 3 10.0 0 NaN dtype: float64
Sort values descending order
>>> s.sort_values(ascending=False) 3 10.0 4 5.0 2 3.0 1 1.0 0 NaN dtype: float64
Sort values putting NAs first
>>> s.sort_values(na_position='first') 0 NaN 1 1.0 2 3.0 4 5.0 3 10.0 dtype: float64
Sort a series of strings
>>> s = pd.Series(['z', 'b', 'd', 'a', 'c']) >>> s 0 z 1 b 2 d 3 a 4 c dtype: object
>>> s.sort_values() 3 a 1 b 4 c 2 d 0 z dtype: object
Sort using a key function. Your key function will be given the
Seriesof values and should return an array-like.>>> s = pd.Series(['a', 'B', 'c', 'D', 'e']) >>> s.sort_values() 1 B 3 D 0 a 2 c 4 e dtype: object >>> s.sort_values(key=lambda x: x.str.lower()) 0 a 1 B 2 c 3 D 4 e dtype: object
NumPy ufuncs work well here. For example, we can sort by the
sinof the value>>> s = pd.Series([-4, -2, 0, 2, 4]) >>> s.sort_values(key=np.sin) 1 -2 4 4 2 0 0 -4 3 2 dtype: int64
More complicated user-defined functions can be used, as long as they expect a Series and return an array-like
>>> s.sort_values(key=lambda x: (np.tan(x.cumsum()))) 0 -4 3 2 4 4 1 -2 2 0 dtype: int64
- sparse
alias of
SparseAccessor
- std(axis: Axis | None = None, skipna: bool = True, ddof: int = 1, numeric_only: bool = False, **kwargs)
Return sample standard deviation over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument.
- Parameters:
axis ({index (0)}) –
For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.std with
axis=Noneis deprecated, in a future version this will reduce over both axes and return a scalar To retain the old behavior, pass axis=0 (or do not pass axis).skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
- Return type:
scalar or Series (if level specified)
Notes
To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)
Examples
>>> df = pd.DataFrame({'person_id': [0, 1, 2, 3], ... 'age': [21, 25, 62, 43], ... 'height': [1.61, 1.87, 1.49, 2.01]} ... ).set_index('person_id') >>> df age height person_id 0 21 1.61 1 25 1.87 2 62 1.49 3 43 2.01
The standard deviation of the columns can be found as follows:
>>> df.std() age 18.786076 height 0.237417 dtype: float64
Alternatively, ddof=0 can be set to normalize by N instead of N-1:
>>> df.std(ddof=0) age 16.269219 height 0.205609 dtype: float64
- str
alias of
StringMethods
- struct
alias of
StructAccessor
- sub(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Subtraction of series and other, element-wise (binary operator sub).
Equivalent to
series - other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.rsubReverse of the Subtraction operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.subtract(b, fill_value=0) a 0.0 b 1.0 c 1.0 d -1.0 e NaN dtype: float64
- subtract(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Subtraction of series and other, element-wise (binary operator sub).
Equivalent to
series - other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.rsubReverse of the Subtraction operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.subtract(b, fill_value=0) a 0.0 b 1.0 c 1.0 d -1.0 e NaN dtype: float64
- sum(axis: Axis | None = None, skipna: bool = True, numeric_only: bool = False, min_count: int = 0, **kwargs)
Return the sum of the values over the requested axis.
This is equivalent to the method
numpy.sum.- Parameters:
axis ({index (0)}) –
Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.sum with
axis=Noneis deprecated, in a future version this will reduce over both axes and return a scalar To retain the old behavior, pass axis=0 (or do not pass axis).Added in version 2.0.0.
skipna (bool, default True) – Exclude NA/null values when computing the result.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than
min_countnon-NA values are present the result will be NA.**kwargs – Additional keyword arguments to be passed to the function.
- Return type:
scalar or scalar
See also
Series.sumReturn the sum.
Series.minReturn the minimum.
Series.maxReturn the maximum.
Series.idxminReturn the index of the minimum.
Series.idxmaxReturn the index of the maximum.
DataFrame.sumReturn the sum over the requested axis.
DataFrame.minReturn the minimum over the requested axis.
DataFrame.maxReturn the maximum over the requested axis.
DataFrame.idxminReturn the index of the minimum over the requested axis.
DataFrame.idxmaxReturn the index of the maximum over the requested axis.
Examples
>>> idx = pd.MultiIndex.from_arrays([ ... ['warm', 'warm', 'cold', 'cold'], ... ['dog', 'falcon', 'fish', 'spider']], ... names=['blooded', 'animal']) >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> s blooded animal warm dog 4 falcon 2 cold fish 0 spider 8 Name: legs, dtype: int64
>>> s.sum() 14
By default, the sum of an empty or all-NA Series is
0.>>> pd.Series([], dtype="float64").sum() # min_count=0 is the default 0.0
This can be controlled with the
min_countparameter. For example, if you’d like the sum of an empty series to be NaN, passmin_count=1.>>> pd.Series([], dtype="float64").sum(min_count=1) nan
Thanks to the
skipnaparameter,min_counthandles all-NA and empty series identically.>>> pd.Series([np.nan]).sum() 0.0
>>> pd.Series([np.nan]).sum(min_count=1) nan
- swaplevel(i: Level = -2, j: Level = -1, copy: bool | None = None) Series
Swap levels i and j in a
MultiIndex.Default is to swap the two innermost levels of the index.
- Parameters:
i (int or str) – Levels of the indices to be swapped. Can pass level name as string.
j (int or str) – Levels of the indices to be swapped. Can pass level name as string.
copy (bool, default True) –
Whether to copy underlying data.
Note
The copy keyword will change behavior in pandas 3.0. Copy-on-Write will be enabled by default, which means that all methods with a copy keyword will use a lazy copy mechanism to defer the copy and ignore the copy keyword. The copy keyword will be removed in a future version of pandas.
You can already get the future behavior and improvements through enabling copy on write
pd.options.mode.copy_on_write = True
- Returns:
Series with levels swapped in MultiIndex.
- Return type:
Examples
>>> s = pd.Series( ... ["A", "B", "A", "C"], ... index=[ ... ["Final exam", "Final exam", "Coursework", "Coursework"], ... ["History", "Geography", "History", "Geography"], ... ["January", "February", "March", "April"], ... ], ... ) >>> s Final exam History January A Geography February B Coursework History March A Geography April C dtype: object
In the following example, we will swap the levels of the indices. Here, we will swap the levels column-wise, but levels can be swapped row-wise in a similar manner. Note that column-wise is the default behaviour. By not supplying any arguments for i and j, we swap the last and second to last indices.
>>> s.swaplevel() Final exam January History A February Geography B Coursework March History A April Geography C dtype: object
By supplying one argument, we can choose which index to swap the last index with. We can for example swap the first index with the last one as follows.
>>> s.swaplevel(0) January History Final exam A February Geography Final exam B March History Coursework A April Geography Coursework C dtype: object
We can also define explicitly which indices we want to swap by supplying values for both i and j. Here, we for example swap the first and second indices.
>>> s.swaplevel(0, 1) History Final exam January A Geography Final exam February B History Coursework March A Geography Coursework April C dtype: object
- to_dict(*, into: type[MutableMappingT] | MutableMappingT) MutableMappingT
- to_dict(*, into: type[dict] = <class 'dict'>) dict
Convert Series to {label -> value} dict or dict-like object.
- Parameters:
into (class, default dict) – The collections.abc.MutableMapping subclass to use as the return object. Can be the actual class or an empty instance of the mapping type you want. If you want a collections.defaultdict, you must pass it initialized.
- Returns:
Key-value representation of Series.
- Return type:
collections.abc.MutableMapping
Examples
>>> s = pd.Series([1, 2, 3, 4]) >>> s.to_dict() {0: 1, 1: 2, 2: 3, 3: 4} >>> from collections import OrderedDict, defaultdict >>> s.to_dict(into=OrderedDict) OrderedDict([(0, 1), (1, 2), (2, 3), (3, 4)]) >>> dd = defaultdict(list) >>> s.to_dict(into=dd) defaultdict(<class 'list'>, {0: 1, 1: 2, 2: 3, 3: 4})
- to_frame(name: Hashable = <no_default>) DataFrame
Convert Series to DataFrame.
- Parameters:
name (object, optional) – The passed name should substitute for the series name (if it has one).
- Returns:
DataFrame representation of Series.
- Return type:
Examples
>>> s = pd.Series(["a", "b", "c"], ... name="vals") >>> s.to_frame() vals 0 a 1 b 2 c
- to_markdown(buf: IO[str] | None = None, mode: str = 'wt', index: bool = True, storage_options: StorageOptions | None = None, **kwargs) str | None
Print Series in Markdown-friendly format.
- Parameters:
buf (str, Path or StringIO-like, optional, default None) – Buffer to write to. If None, the output is returned as a string.
mode (str, optional) – Mode in which file is opened, “wt” by default.
index (bool, optional, default True) – Add index (row) labels.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib.request.Requestas header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec.open. Please seefsspecandurllibfor more details, and for more examples on storage options refer here.**kwargs –
These parameters will be passed to tabulate.
- Returns:
Series in Markdown-friendly format.
- Return type:
str
Notes
Requires the tabulate package.
- Examples
>>> s = pd.Series(["elk", "pig", "dog", "quetzal"], name="animal") >>> print(s.to_markdown()) | | animal | |---:|:---------| | 0 | elk | | 1 | pig | | 2 | dog | | 3 | quetzal |
Output markdown with a tabulate option.
>>> print(s.to_markdown(tablefmt="grid")) +----+----------+ | | animal | +====+==========+ | 0 | elk | +----+----------+ | 1 | pig | +----+----------+ | 2 | dog | +----+----------+ | 3 | quetzal | +----+----------+
- to_period(freq: str | None = None, copy: bool | None = None) Series
Convert Series from DatetimeIndex to PeriodIndex.
- Parameters:
freq (str, default None) – Frequency associated with the PeriodIndex.
copy (bool, default True) –
Whether or not to return a copy.
Note
The copy keyword will change behavior in pandas 3.0. Copy-on-Write will be enabled by default, which means that all methods with a copy keyword will use a lazy copy mechanism to defer the copy and ignore the copy keyword. The copy keyword will be removed in a future version of pandas.
You can already get the future behavior and improvements through enabling copy on write
pd.options.mode.copy_on_write = True
- Returns:
Series with index converted to PeriodIndex.
- Return type:
Examples
>>> idx = pd.DatetimeIndex(['2023', '2024', '2025']) >>> s = pd.Series([1, 2, 3], index=idx) >>> s = s.to_period() >>> s 2023 1 2024 2 2025 3 Freq: Y-DEC, dtype: int64
Viewing the index
>>> s.index PeriodIndex(['2023', '2024', '2025'], dtype='period[Y-DEC]')
- to_string(buf: None = None, na_rep: str = 'NaN', float_format: str | None = None, header: bool = True, index: bool = True, length: bool = False, dtype=False, name=False, max_rows: int | None = None, min_rows: int | None = None) str
- to_string(buf: FilePath | WriteBuffer[str], na_rep: str = 'NaN', float_format: str | None = None, header: bool = True, index: bool = True, length: bool = False, dtype=False, name=False, max_rows: int | None = None, min_rows: int | None = None) None
Render a string representation of the Series.
- Parameters:
buf (StringIO-like, optional) – Buffer to write to.
na_rep (str, optional) – String representation of NaN to use, default ‘NaN’.
float_format (one-parameter function, optional) – Formatter function to apply to columns’ elements if they are floats, default None.
header (bool, default True) – Add the Series header (index name).
index (bool, optional) – Add index (row) labels, default True.
length (bool, default False) – Add the Series length.
dtype (bool, default False) – Add the Series dtype.
name (bool, default False) – Add the Series name if not None.
max_rows (int, optional) – Maximum number of rows to show before truncating. If None, show all.
min_rows (int, optional) – The number of rows to display in a truncated repr (when number of rows is above max_rows).
- Returns:
String representation of Series if
buf=None, otherwise None.- Return type:
str or None
Examples
>>> ser = pd.Series([1, 2, 3]).to_string() >>> ser '0 1\n1 2\n2 3'
- to_timestamp(freq: Frequency | None = None, how: Literal['s', 'e', 'start', 'end'] = 'start', copy: bool | None = None) Series
Cast to DatetimeIndex of Timestamps, at beginning of period.
- Parameters:
freq (str, default frequency of PeriodIndex) – Desired frequency.
how ({'s', 'e', 'start', 'end'}) – Convention for converting period to timestamp; start of period vs. end.
copy (bool, default True) –
Whether or not to return a copy.
Note
The copy keyword will change behavior in pandas 3.0. Copy-on-Write will be enabled by default, which means that all methods with a copy keyword will use a lazy copy mechanism to defer the copy and ignore the copy keyword. The copy keyword will be removed in a future version of pandas.
You can already get the future behavior and improvements through enabling copy on write
pd.options.mode.copy_on_write = True
- Return type:
Series with DatetimeIndex
Examples
>>> idx = pd.PeriodIndex(['2023', '2024', '2025'], freq='Y') >>> s1 = pd.Series([1, 2, 3], index=idx) >>> s1 2023 1 2024 2 2025 3 Freq: Y-DEC, dtype: int64
The resulting frequency of the Timestamps is YearBegin
>>> s1 = s1.to_timestamp() >>> s1 2023-01-01 1 2024-01-01 2 2025-01-01 3 Freq: YS-JAN, dtype: int64
Using freq which is the offset that the Timestamps will have
>>> s2 = pd.Series([1, 2, 3], index=idx) >>> s2 = s2.to_timestamp(freq='M') >>> s2 2023-01-31 1 2024-01-31 2 2025-01-31 3 Freq: YE-JAN, dtype: int64
- transform(func: AggFuncType, axis: Axis = 0, *args, **kwargs) DataFrame | Series
Call
funcon self producing a Series with the same axis shape as self.- Parameters:
func (function, str, list-like or dict-like) –
Function to use for transforming the data. If a function, must either work when passed a Series or when passed to Series.apply. If func is both list-like and dict-like, dict-like behavior takes precedence.
Accepted combinations are:
function
string function name
list-like of functions and/or function names, e.g.
[np.exp, 'sqrt']dict-like of axis labels -> functions, function names or list-like of such.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
*args – Positional arguments to pass to func.
**kwargs – Keyword arguments to pass to func.
- Returns:
A Series that must have the same length as self.
- Return type:
:raises ValueError : If the returned Series has a different length than self.:
See also
Series.aggOnly perform aggregating type operations.
Series.applyInvoke function on a Series.
Notes
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.
Examples
>>> df = pd.DataFrame({'A': range(3), 'B': range(1, 4)}) >>> df A B 0 0 1 1 1 2 2 2 3 >>> df.transform(lambda x: x + 1) A B 0 1 2 1 2 3 2 3 4
Even though the resulting Series must have the same length as the input Series, it is possible to provide several input functions:
>>> s = pd.Series(range(3)) >>> s 0 0 1 1 2 2 dtype: int64 >>> s.transform([np.sqrt, np.exp]) sqrt exp 0 0.000000 1.000000 1 1.000000 2.718282 2 1.414214 7.389056
You can call transform on a GroupBy object:
>>> df = pd.DataFrame({ ... "Date": [ ... "2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05", ... "2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05"], ... "Data": [5, 8, 6, 1, 50, 100, 60, 120], ... }) >>> df Date Data 0 2015-05-08 5 1 2015-05-07 8 2 2015-05-06 6 3 2015-05-05 1 4 2015-05-08 50 5 2015-05-07 100 6 2015-05-06 60 7 2015-05-05 120 >>> df.groupby('Date')['Data'].transform('sum') 0 55 1 108 2 66 3 121 4 55 5 108 6 66 7 121 Name: Data, dtype: int64
>>> df = pd.DataFrame({ ... "c": [1, 1, 1, 2, 2, 2, 2], ... "type": ["m", "n", "o", "m", "m", "n", "n"] ... }) >>> df c type 0 1 m 1 1 n 2 1 o 3 2 m 4 2 m 5 2 n 6 2 n >>> df['size'] = df.groupby('c')['type'].transform(len) >>> df c type size 0 1 m 3 1 1 n 3 2 1 o 3 3 2 m 4 4 2 m 4 5 2 n 4 6 2 n 4
- truediv(other, level=None, fill_value=None, axis: Axis = 0) Series
Return Floating division of series and other, element-wise (binary operator truediv).
Equivalent to
series / other, but with support to substitute a fill_value for missing data in either one of the inputs.- Parameters:
other (Series or scalar value)
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.
axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.
- Returns:
The result of the operation.
- Return type:
See also
Series.rtruedivReverse of the Floating division operator, see Python documentation for more details.
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.divide(b, fill_value=0) a 1.0 b inf c inf d 0.0 e NaN dtype: float64
- unique() ArrayLike
Return unique values of Series object.
Uniques are returned in order of appearance. Hash table-based unique, therefore does NOT sort.
- Returns:
The unique values returned as a NumPy array. See Notes.
- Return type:
ndarray or ExtensionArray
See also
Series.drop_duplicatesReturn Series with duplicate values removed.
uniqueTop-level unique method for any 1-d array-like object.
Index.uniqueReturn Index with unique values from an Index object.
Notes
Returns the unique values as a NumPy array. In case of an extension-array backed Series, a new
ExtensionArrayof that type with just the unique values is returned. This includesCategorical
Period
Datetime with Timezone
Datetime without Timezone
Timedelta
Interval
Sparse
IntegerNA
See Examples section.
Examples
>>> pd.Series([2, 1, 3, 3], name='A').unique() array([2, 1, 3])
>>> pd.Series([pd.Timestamp('2016-01-01') for _ in range(3)]).unique() <DatetimeArray> ['2016-01-01 00:00:00'] Length: 1, dtype: datetime64[ns]
>>> pd.Series([pd.Timestamp('2016-01-01', tz='US/Eastern') ... for _ in range(3)]).unique() <DatetimeArray> ['2016-01-01 00:00:00-05:00'] Length: 1, dtype: datetime64[ns, US/Eastern]
An Categorical will return categories in the order of appearance and with the same dtype.
>>> pd.Series(pd.Categorical(list('baabc'))).unique() ['b', 'a', 'c'] Categories (3, object): ['a', 'b', 'c'] >>> pd.Series(pd.Categorical(list('baabc'), categories=list('abc'), ... ordered=True)).unique() ['b', 'a', 'c'] Categories (3, object): ['a' < 'b' < 'c']
- unstack(level: IndexLabel = -1, fill_value: Hashable | None = None, sort: bool = True) DataFrame
Unstack, also known as pivot, Series with MultiIndex to produce DataFrame.
- Parameters:
level (int, str, or list of these, default last level) – Level(s) to unstack, can pass level name.
fill_value (scalar value, default None) – Value to use when replacing NaN values.
sort (bool, default True) – Sort the level(s) in the resulting MultiIndex columns.
- Returns:
Unstacked Series.
- Return type:
Notes
Reference the user guide for more examples.
Examples
>>> s = pd.Series([1, 2, 3, 4], ... index=pd.MultiIndex.from_product([['one', 'two'], ... ['a', 'b']])) >>> s one a 1 b 2 two a 3 b 4 dtype: int64
>>> s.unstack(level=-1) a b one 1 2 two 3 4
>>> s.unstack(level=0) one two a 1 3 b 2 4
- update(other: Series | Sequence | Mapping) None
Modify Series in place using values from passed Series.
Uses non-NA values from passed Series to make updates. Aligns on index.
- Parameters:
other (Series, or object coercible into Series)
Examples
>>> s = pd.Series([1, 2, 3]) >>> s.update(pd.Series([4, 5, 6])) >>> s 0 4 1 5 2 6 dtype: int64
>>> s = pd.Series(['a', 'b', 'c']) >>> s.update(pd.Series(['d', 'e'], index=[0, 2])) >>> s 0 d 1 b 2 e dtype: object
>>> s = pd.Series([1, 2, 3]) >>> s.update(pd.Series([4, 5, 6, 7, 8])) >>> s 0 4 1 5 2 6 dtype: int64
If
othercontains NaNs the corresponding values are not updated in the original Series.>>> s = pd.Series([1, 2, 3]) >>> s.update(pd.Series([4, np.nan, 6])) >>> s 0 4 1 2 2 6 dtype: int64
othercan also be a non-Series object type that is coercible into a Series>>> s = pd.Series([1, 2, 3]) >>> s.update([4, np.nan, 6]) >>> s 0 4 1 2 2 6 dtype: int64
>>> s = pd.Series([1, 2, 3]) >>> s.update({1: 9}) >>> s 0 1 1 9 2 3 dtype: int64
- property values
Return Series as ndarray or ndarray-like depending on the dtype.
Warning
We recommend using
Series.arrayorSeries.to_numpy(), depending on whether you need a reference to the underlying data or a NumPy array.- Return type:
numpy.ndarray or ndarray-like
See also
Series.arrayReference to the underlying data.
Series.to_numpyA NumPy array representing the underlying data.
Examples
>>> pd.Series([1, 2, 3]).values array([1, 2, 3])
>>> pd.Series(list('aabc')).values array(['a', 'a', 'b', 'c'], dtype=object)
>>> pd.Series(list('aabc')).astype('category').values ['a', 'a', 'b', 'c'] Categories (3, object): ['a', 'b', 'c']
Timezone aware datetime data is converted to UTC:
>>> pd.Series(pd.date_range('20130101', periods=3, ... tz='US/Eastern')).values array(['2013-01-01T05:00:00.000000000', '2013-01-02T05:00:00.000000000', '2013-01-03T05:00:00.000000000'], dtype='datetime64[ns]')
- var(axis: Axis | None = None, skipna: bool = True, ddof: int = 1, numeric_only: bool = False, **kwargs)
Return unbiased variance over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument.
- Parameters:
axis ({index (0)}) –
For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.var with
axis=Noneis deprecated, in a future version this will reduce over both axes and return a scalar To retain the old behavior, pass axis=0 (or do not pass axis).skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.
- Return type:
scalar or Series (if level specified)
Examples
>>> df = pd.DataFrame({'person_id': [0, 1, 2, 3], ... 'age': [21, 25, 62, 43], ... 'height': [1.61, 1.87, 1.49, 2.01]} ... ).set_index('person_id') >>> df age height person_id 0 21 1.61 1 25 1.87 2 62 1.49 3 43 2.01
>>> df.var() age 352.916667 height 0.056367 dtype: float64
Alternatively,
ddof=0can be set to normalize by N instead of N-1:>>> df.var(ddof=0) age 264.687500 height 0.042275 dtype: float64
- view(dtype: Dtype | None = None) Series
Create a new view of the Series.
Deprecated since version 2.2.0:
Series.viewis deprecated and will be removed in a future version. UseSeries.astype()as an alternative to change the dtype.This function will return a new Series with a view of the same underlying values in memory, optionally reinterpreted with a new data type. The new data type must preserve the same size in bytes as to not cause index misalignment.
- Parameters:
dtype (data type) – Data type object or one of their string representations.
- Returns:
A new Series object as a view of the same data in memory.
- Return type:
See also
numpy.ndarray.viewEquivalent numpy function to create a new view of the same data in memory.
Notes
Series are instantiated with
dtype=float64by default. Whilenumpy.ndarray.view()will return a view with the same data type as the original array,Series.view()(without specified dtype) will try usingfloat64and may fail if the original data type size in bytes is not the same.Examples
Use
astypeto change the dtype instead.
- class pyranges1.ext.stats.combinations_with_replacement(iterable, r)
Return successive r-length combinations of elements in the iterable allowing individual elements to have successive repeats.
combinations_with_replacement(‘ABC’, 2) –> (‘A’,’A’), (‘A’,’B’), (‘A’,’C’), (‘B’,’B’), (‘B’,’C’), (‘C’,’C’)
- class pyranges1.ext.stats.defaultdict
defaultdict(default_factory=None, /, […]) –> dict with default factory
The default factory is called without arguments to produce a new value when a key is not present, in __getitem__ only. A defaultdict compares equal to a dict with the same items. All remaining arguments are treated the same as if they were passed to the dict constructor, including keyword arguments.
- copy() a shallow copy of D.
- default_factory
Factory for default value called by __missing__().
- pyranges1.ext.stats.ensure_pyranges(df: pd.DataFrame) PyRanges
Ensure df is a PyRanges.
Helps pyright.
- pyranges1.ext.stats.fdr(p_vals: Series) Series
Adjust p-values with Benjamini-Hochberg.
- Parameters:
p_vals (array-like) – P-values to adjust.
- Returns:
DataFrame where values are order of data.
- Return type:
Pandas.DataFrame
Examples
>>> import pyranges1 as pr >>> d = {'Chromosome': ['chr3', 'chr6', 'chr13'], 'Start': [146419383, 39800100, 24537618], 'End': [146419483, 39800200, 24537718], 'Strand': ['-', '+', '-'], 'PValue': [0.0039591368855297175, 0.0037600512992788937, 0.0075061166500909205]} >>> gr = pr.PyRanges(d) >>> gr index | Chromosome Start End Strand PValue int64 | str int64 int64 str float64 ------- --- ------------ --------- --------- -------- ---------- 0 | chr3 146419383 146419483 - 0.00395914 1 | chr6 39800100 39800200 + 0.00376005 2 | chr13 24537618 24537718 - 0.00750612 PyRanges with 3 rows, 5 columns, and 1 index columns. Contains 3 chromosomes and 2 strands.
>>> gr["FDR"] = pr.stats.fdr(gr.PValue) >>> gr index | Chromosome Start End Strand PValue FDR int64 | str int64 int64 str float64 float64 ------- --- ------------ --------- --------- -------- ---------- ---------- 0 | chr3 146419383 146419483 - 0.00395914 0.00593871 1 | chr6 39800100 39800200 + 0.00376005 0.0112802 2 | chr13 24537618 24537718 - 0.00750612 0.00750612 PyRanges with 3 rows, 6 columns, and 1 index columns. Contains 3 chromosomes and 2 strands.
- pyranges1.ext.stats.fisher_exact(tp: Series, fp: Series, fn: Series, tn: Series, pseudocount: int = 0) DataFrame
Fisher’s exact for contingency tables.
Computes the hypotheses two-sided, less and greater at the same time.
The odds-ratio is
- Parameters:
tp (array-like of int) – Top left square of contingency table (true positives).
fp (array-like of int) – Top right square of contingency table (false positives).
fn (array-like of int) – Bottom left square of contingency table (false negatives).
tn (array-like of int) – Bottom right square of contingency table (true negatives).
pseudocount (float, default 0) – Values > 0 allow Odds Ratio to always be a finite number.
Notes
The odds-ratio is computed thusly:
((tp + pseudocount) / (fp + pseudocount)) / ((fn + pseudocount) / (tn + pseudocount))- Returns:
DataFrame with columns odds_ratio and P, PLeft and PRight.
- Return type:
pandas.DataFrame
See also
pr.stats.fdrcorrect for multiple testing
Examples
>>> d = {"TP": [12, 0], "FP": [5, 12], "TN": [29, 10], "FN": [2, 2]} >>> df = pd.DataFrame(d) >>> df TP FP TN FN 0 12 5 29 2 1 0 12 10 2
>>> pr.stats.fisher_exact(df.TP, df.FP, df.TN, df.FN) odds_ratio P PLeft PRight 0 0.165517 0.080269 0.044555 0.994525 1 0.000000 0.000067 0.000034 1.000000
- pyranges1.ext.stats.forbes(p: PyRanges, other: PyRanges, chromsizes: PyRanges | DataFrame | dict[Any, int], strand_behavior: Literal['auto', 'same', 'opposite', 'ignore'] = 'auto') float
Compute Forbes coefficient.
Ratio which represents observed versus expected co-occurence.
Described in
Forbes SA (1907): On the local distribution of certain Illinois fishes: an essay in statistical ecology.- Parameters:
p (PyRanges) – Intervals to compare.
other (PyRanges) – Intervals to compare with.
chromsizes (int, dict, DataFrame or PyRanges) – Integer representing genome length or mapping from chromosomes to its length.
strand_behavior ({"auto", "same", "opposite", "ignore"}, default "auto") – Whether to consider overlaps of intervals on the same strand, the opposite or ignore strand information. The default, “auto”, means use “same” if both PyRanges are stranded (see .strand_valid) otherwise ignore the strand information.
- Returns:
Ratio of observed versus expected co-occurence.
- Return type:
float
See also
pyranges.stats.jaccardcompute the jaccard coefficient
Examples
>>> gr, gr2 = pr.example_data.f1, pr.example_data.f2 >>> float(pr.stats.forbes(gr, gr2, chromsizes={"chr1": 10})) 0.8333333333333334
- pyranges1.ext.stats.jaccard(p: PyRanges, other: PyRanges, strand_behavior: Literal['auto', 'same', 'opposite', 'ignore'] = 'auto') float
Compute Jaccards coefficient.
Ratio of the intersection and union of two sets.
- Parameters:
p (PyRanges) – Intervals to compare.
other (PyRanges) – Intervals to compare with.
strand_behavior ({"auto", "same", "opposite", "ignore"}, default "auto") – Whether to consider overlaps of intervals on the same strand, the opposite or ignore strand information. The default, “auto”, means use “same” if both PyRanges are stranded (see .strand_valid) otherwise ignore the strand information.
- Returns:
Ratio of the intersection and union of two sets.
- Return type:
float
See also
pyranges.stats.forbescompute the forbes coefficient
Examples
>>> gr, gr2 = pr.example_data.f1, pr.example_data.f2 >>> chromsizes = pr.example_data.chromsizes >>> float(pr.stats.jaccard(gr, gr2)) 0.14285714285714285
- pyranges1.ext.stats.mcc(grs: list[PyRanges], *, labels: list[str], genome: PyRanges | pd.DataFrame | dict[str, int] | None = None, use_strand: bool = False) DataFrame
Compute Matthew’s correlation coefficient for PyRanges overlaps.
- Parameters:
grs (list of PyRanges) – PyRanges to compare.
genome (DataFrame or dict, default None) – Should contain chromosome sizes. By default, end position of the rightmost intervals are used as proxies for the chromosome size, but it is recommended to use a genome.
labels (list of str, default None) – Names to give the PyRanges in the output.
use_strand (bool, default False) – Whether to compute correlations per strand.
Examples
>>> grs = [pr.example_data.aorta, pr.example_data.aorta, pr.example_data.aorta2] >>> mcc = pr.stats.mcc(grs, labels="abc", genome={"chr1": 2100000}) >>> mcc T F TP FP TN FN MCC 0 a a 728 0 2099272 0 1.00000 1 a b 728 0 2099272 0 1.00000 3 a c 457 485 2098787 271 0.55168 2 b a 728 0 2099272 0 1.00000 5 b b 728 0 2099272 0 1.00000 6 b c 457 485 2098787 271 0.55168 4 c a 457 271 2098787 485 0.55168 7 c b 457 271 2098787 485 0.55168 8 c c 942 0 2099058 0 1.00000
To create a symmetric matrix (useful for heatmaps of correlations):
>>> mcc.set_index(["T", "F"]).MCC.unstack().rename_axis(None, axis=0) F a b c a 1.00000 1.00000 0.55168 b 1.00000 1.00000 0.55168 c 0.55168 0.55168 1.00000
- class pyranges1.ext.stats.ndarray
- ndarray(shape, dtype=float, buffer=None, offset=0,
strides=None, order=None)
An array object represents a multidimensional, homogeneous array of fixed-size items. An associated data-type object describes the format of each element in the array (its byte-order, how many bytes it occupies in memory, whether it is an integer, a floating point number, or something else, etc.)
Arrays should be constructed using array, zeros or empty (refer to the See Also section below). The parameters given here refer to a low-level method (ndarray(…)) for instantiating an array.
For more information, refer to the numpy module and examine the methods and attributes of an array.
- Parameters:
below) ((for the __new__ method; see Notes)
shape (tuple of ints) – Shape of created array.
dtype (data-type, optional) – Any object that can be interpreted as a numpy data type.
buffer (object exposing buffer interface, optional) – Used to fill the array with data.
offset (int, optional) – Offset of array data in buffer.
strides (tuple of ints, optional) – Strides of data in memory.
order ({'C', 'F'}, optional) – Row-major (C-style) or column-major (Fortran-style) order.
- data
The array’s elements, in memory.
- Type:
buffer
- dtype
Describes the format of the elements in the array.
- Type:
dtype object
- flags
Dictionary containing information related to memory use, e.g., ‘C_CONTIGUOUS’, ‘OWNDATA’, ‘WRITEABLE’, etc.
- Type:
dict
- flat
Flattened version of the array as an iterator. The iterator allows assignments, e.g.,
x.flat = 3(See ndarray.flat for assignment examples; TODO).- Type:
numpy.flatiter object
- size
Number of elements in the array.
- Type:
int
- itemsize
The memory use of each array element in bytes.
- Type:
int
- nbytes
The total number of bytes required to store the array data, i.e.,
itemsize * size.- Type:
int
- ndim
The array’s number of dimensions.
- Type:
int
- shape
Shape of the array.
- Type:
tuple of ints
- strides
The step-size required to move from one element to the next in memory. For example, a contiguous
(3, 4)array of typeint16in C-order has strides(8, 2). This implies that to move from element to element in memory requires jumps of 2 bytes. To move from row-to-row, one needs to jump 8 bytes at a time (2 * 4).- Type:
tuple of ints
- ctypes
Class containing properties of the array needed for interaction with ctypes.
- Type:
ctypes object
- base
If the array is a view into another array, that array is its base (unless that array is also a view). The base array is where the array data is actually stored.
- Type:
See also
arrayConstruct an array.
zerosCreate an array, each element of which is zero.
emptyCreate an array, but leave its allocated memory unchanged (i.e., it contains “garbage”).
dtypeCreate a data-type.
numpy.typing.NDArrayAn ndarray alias generic w.r.t. its dtype.type <numpy.dtype.type>.
Notes
There are two modes of creating an array using
__new__:If buffer is None, then only shape, dtype, and order are used.
If buffer is an object exposing the buffer interface, then all keywords are interpreted.
No
__init__method is needed because the array is fully initialized after the__new__method.Examples
These examples illustrate the low-level ndarray constructor. Refer to the See Also section above for easier ways of constructing an ndarray.
First mode, buffer is None:
>>> np.ndarray(shape=(2,2), dtype=float, order='F') array([[0.0e+000, 0.0e+000], # random [ nan, 2.5e-323]])
Second mode:
>>> np.ndarray((2,), buffer=np.array([1,2,3]), ... offset=np.int_().itemsize, ... dtype=int) # offset = 1*itemsize, i.e. skip first element array([2, 3])
- T
View of the transposed array.
Same as
self.transpose().Examples
>>> a = np.array([[1, 2], [3, 4]]) >>> a array([[1, 2], [3, 4]]) >>> a.T array([[1, 3], [2, 4]])
>>> a = np.array([1, 2, 3, 4]) >>> a array([1, 2, 3, 4]) >>> a.T array([1, 2, 3, 4])
See also
- all(axis=None, out=None, keepdims=False, *, where=True)
Returns True if all elements evaluate to True.
Refer to numpy.all for full documentation.
See also
numpy.allequivalent function
- any(axis=None, out=None, keepdims=False, *, where=True)
Returns True if any of the elements of a evaluate to True.
Refer to numpy.any for full documentation.
See also
numpy.anyequivalent function
- argmax(axis=None, out=None, *, keepdims=False)
Return indices of the maximum values along the given axis.
Refer to numpy.argmax for full documentation.
See also
numpy.argmaxequivalent function
- argmin(axis=None, out=None, *, keepdims=False)
Return indices of the minimum values along the given axis.
Refer to numpy.argmin for detailed documentation.
See also
numpy.argminequivalent function
- argpartition(kth, axis=-1, kind='introselect', order=None)
Returns the indices that would partition this array.
Refer to numpy.argpartition for full documentation.
Added in version 1.8.0.
See also
numpy.argpartitionequivalent function
- argsort(axis=-1, kind=None, order=None)
Returns the indices that would sort this array.
Refer to numpy.argsort for full documentation.
See also
numpy.argsortequivalent function
- astype(dtype, order='K', casting='unsafe', subok=True, copy=True)
Copy of the array, cast to a specified type.
- Parameters:
dtype (str or dtype) – Typecode or data-type to which the array is cast.
order ({'C', 'F', 'A', 'K'}, optional) – Controls the memory layout order of the result. ‘C’ means C order, ‘F’ means Fortran order, ‘A’ means ‘F’ order if all the arrays are Fortran contiguous, ‘C’ order otherwise, and ‘K’ means as close to the order the array elements appear in memory as possible. Default is ‘K’.
casting ({'no', 'equiv', 'safe', 'same_kind', 'unsafe'}, optional) –
Controls what kind of data casting may occur. Defaults to ‘unsafe’ for backwards compatibility.
’no’ means the data types should not be cast at all.
’equiv’ means only byte-order changes are allowed.
’safe’ means only casts which can preserve values are allowed.
’same_kind’ means only safe casts or casts within a kind, like float64 to float32, are allowed.
’unsafe’ means any data conversions may be done.
subok (bool, optional) – If True, then sub-classes will be passed-through (default), otherwise the returned array will be forced to be a base-class array.
copy (bool, optional) – By default, astype always returns a newly allocated array. If this is set to false, and the dtype, order, and subok requirements are satisfied, the input array is returned instead of a copy.
- Returns:
arr_t – Unless copy is False and the other conditions for returning the input array are satisfied (see description for copy input parameter), arr_t is a new array of the same shape as the input array, with dtype, order given by dtype, order.
- Return type:
Notes
Changed in version 1.17.0: Casting between a simple data type and a structured one is possible only for “unsafe” casting. Casting to multiple fields is allowed, but casting from multiple fields is not.
Changed in version 1.9.0: Casting from numeric to string types in ‘safe’ casting mode requires that the string dtype length is long enough to store the max integer/float value converted.
- Raises:
ComplexWarning – When casting from complex to float or int. To avoid this, one should use
a.real.astype(t).
Examples
>>> x = np.array([1, 2, 2.5]) >>> x array([1. , 2. , 2.5])
>>> x.astype(int) array([1, 2, 2])
- base
Base object if memory is from some other object.
Examples
The base of an array that owns its memory is None:
>>> x = np.array([1,2,3,4]) >>> x.base is None True
Slicing creates a view, whose memory is shared with x:
>>> y = x[2:] >>> y.base is x True
- byteswap(inplace=False)
Swap the bytes of the array elements
Toggle between low-endian and big-endian data representation by returning a byteswapped array, optionally swapped in-place. Arrays of byte-strings are not swapped. The real and imaginary parts of a complex number are swapped individually.
- Parameters:
inplace (bool, optional) – If
True, swap bytes in-place, default isFalse.- Returns:
out – The byteswapped array. If inplace is
True, this is a view to self.- Return type:
Examples
>>> A = np.array([1, 256, 8755], dtype=np.int16) >>> list(map(hex, A)) ['0x1', '0x100', '0x2233'] >>> A.byteswap(inplace=True) array([ 256, 1, 13090], dtype=int16) >>> list(map(hex, A)) ['0x100', '0x1', '0x3322']
Arrays of byte-strings are not swapped
>>> A = np.array([b'ceg', b'fac']) >>> A.byteswap() array([b'ceg', b'fac'], dtype='|S3')
A.newbyteorder().byteswap()produces an array with the same valuesbut different representation in memory
>>> A = np.array([1, 2, 3]) >>> A.view(np.uint8) array([1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0], dtype=uint8) >>> A.newbyteorder().byteswap(inplace=True) array([1, 2, 3]) >>> A.view(np.uint8) array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3], dtype=uint8)
- choose(choices, out=None, mode='raise')
Use an index array to construct a new array from a set of choices.
Refer to numpy.choose for full documentation.
See also
numpy.chooseequivalent function
- clip(min=None, max=None, out=None, **kwargs)
Return an array whose values are limited to
[min, max]. One of max or min must be given.Refer to numpy.clip for full documentation.
See also
numpy.clipequivalent function
- compress(condition, axis=None, out=None)
Return selected slices of this array along given axis.
Refer to numpy.compress for full documentation.
See also
numpy.compressequivalent function
- conj()
Complex-conjugate all elements.
Refer to numpy.conjugate for full documentation.
See also
numpy.conjugateequivalent function
- conjugate()
Return the complex conjugate, element-wise.
Refer to numpy.conjugate for full documentation.
See also
numpy.conjugateequivalent function
- copy(order='C')
Return a copy of the array.
- Parameters:
order ({'C', 'F', 'A', 'K'}, optional) – Controls the memory layout of the copy. ‘C’ means C-order, ‘F’ means F-order, ‘A’ means ‘F’ if a is Fortran contiguous, ‘C’ otherwise. ‘K’ means match the layout of a as closely as possible. (Note that this function and
numpy.copy()are very similar but have different default values for their order= arguments, and this function always passes sub-classes through.)
See also
numpy.copySimilar function with different default behavior
numpy.copytoNotes
This function is the preferred method for creating an array copy. The function
numpy.copy()is similar, but it defaults to using order ‘K’, and will not pass sub-classes through by default.Examples
>>> x = np.array([[1,2,3],[4,5,6]], order='F')
>>> y = x.copy()
>>> x.fill(0)
>>> x array([[0, 0, 0], [0, 0, 0]])
>>> y array([[1, 2, 3], [4, 5, 6]])
>>> y.flags['C_CONTIGUOUS'] True
- ctypes
An object to simplify the interaction of the array with the ctypes module.
This attribute creates an object that makes it easier to use arrays when calling shared libraries with the ctypes module. The returned object has, among others, data, shape, and strides attributes (see Notes below) which themselves return ctypes objects that can be used as arguments to a shared library.
- Parameters:
None
- Returns:
c – Possessing attributes data, shape, strides, etc.
- Return type:
Python object
See also
numpy.ctypeslibNotes
Below are the public attributes of this object which were documented in “Guide to NumPy” (we have omitted undocumented public attributes, as well as documented private attributes):
- _ctypes.data
A pointer to the memory area of the array as a Python integer. This memory area may contain data that is not aligned, or not in correct byte-order. The memory area may not even be writeable. The array flags and data-type of this array should be respected when passing this attribute to arbitrary C-code to avoid trouble that can include Python crashing. User Beware! The value of this attribute is exactly the same as
self._array_interface_['data'][0].Note that unlike
data_as, a reference will not be kept to the array: code likectypes.c_void_p((a + b).ctypes.data)will result in a pointer to a deallocated array, and should be spelt(a + b).ctypes.data_as(ctypes.c_void_p)
- _ctypes.shape
A ctypes array of length self.ndim where the basetype is the C-integer corresponding to
dtype('p')on this platform (see ~numpy.ctypeslib.c_intp). This base-type could be ctypes.c_int, ctypes.c_long, or ctypes.c_longlong depending on the platform. The ctypes array contains the shape of the underlying array.- Type:
(c_intp*self.ndim)
- _ctypes.strides
A ctypes array of length self.ndim where the basetype is the same as for the shape attribute. This ctypes array contains the strides information from the underlying array. This strides information is important for showing how many bytes must be jumped to get to the next element in the array.
- Type:
(c_intp*self.ndim)
- _ctypes.data_as(obj)
Return the data pointer cast to a particular c-types object. For example, calling
self._as_parameter_is equivalent toself.data_as(ctypes.c_void_p). Perhaps you want to use the data as a pointer to a ctypes array of floating-point data:self.data_as(ctypes.POINTER(ctypes.c_double)).The returned pointer will keep a reference to the array.
- _ctypes.shape_as(obj)
Return the shape tuple as an array of some other c-types type. For example:
self.shape_as(ctypes.c_short).
- _ctypes.strides_as(obj)
Return the strides tuple as an array of some other c-types type. For example:
self.strides_as(ctypes.c_longlong).
If the ctypes module is not available, then the ctypes attribute of array objects still returns something useful, but ctypes objects are not returned and errors may be raised instead. In particular, the object will still have the
as_parameterattribute which will return an integer equal to the data attribute.Examples
>>> import ctypes >>> x = np.array([[0, 1], [2, 3]], dtype=np.int32) >>> x array([[0, 1], [2, 3]], dtype=int32) >>> x.ctypes.data 31962608 # may vary >>> x.ctypes.data_as(ctypes.POINTER(ctypes.c_uint32)) <__main__.LP_c_uint object at 0x7ff2fc1fc200> # may vary >>> x.ctypes.data_as(ctypes.POINTER(ctypes.c_uint32)).contents c_uint(0) >>> x.ctypes.data_as(ctypes.POINTER(ctypes.c_uint64)).contents c_ulong(4294967296) >>> x.ctypes.shape <numpy.core._internal.c_long_Array_2 object at 0x7ff2fc1fce60> # may vary >>> x.ctypes.strides <numpy.core._internal.c_long_Array_2 object at 0x7ff2fc1ff320> # may vary
- cumprod(axis=None, dtype=None, out=None)
Return the cumulative product of the elements along the given axis.
Refer to numpy.cumprod for full documentation.
See also
numpy.cumprodequivalent function
- cumsum(axis=None, dtype=None, out=None)
Return the cumulative sum of the elements along the given axis.
Refer to numpy.cumsum for full documentation.
See also
numpy.cumsumequivalent function
- data
Python buffer object pointing to the start of the array’s data.
- diagonal(offset=0, axis1=0, axis2=1)
Return specified diagonals. In NumPy 1.9 the returned array is a read-only view instead of a copy as in previous NumPy versions. In a future version the read-only restriction will be removed.
Refer to
numpy.diagonal()for full documentation.See also
numpy.diagonalequivalent function
- dtype
Data-type of the array’s elements.
Warning
Setting
arr.dtypeis discouraged and may be deprecated in the future. Setting will replace thedtypewithout modifying the memory (see also ndarray.view and ndarray.astype).- Parameters:
None
- Returns:
d
- Return type:
numpy dtype object
See also
ndarray.astypeCast the values contained in the array to a new data-type.
ndarray.viewCreate a view of the same data but a different data-type.
numpy.dtypeExamples
>>> x array([[0, 1], [2, 3]]) >>> x.dtype dtype('int32') >>> type(x.dtype) <type 'numpy.dtype'>
- dump(file)
Dump a pickle of the array to the specified file. The array can be read back with pickle.load or numpy.load.
- Parameters:
file (str or Path) –
A string naming the dump file.
Changed in version 1.17.0: pathlib.Path objects are now accepted.
- dumps()
Returns the pickle of the array as a string. pickle.loads will convert the string back to an array.
- Parameters:
None
- fill(value)
Fill the array with a scalar value.
- Parameters:
value (scalar) – All elements of a will be assigned this value.
Examples
>>> a = np.array([1, 2]) >>> a.fill(0) >>> a array([0, 0]) >>> a = np.empty(2) >>> a.fill(1) >>> a array([1., 1.])
Fill expects a scalar value and always behaves the same as assigning to a single array element. The following is a rare example where this distinction is important:
>>> a = np.array([None, None], dtype=object) >>> a[0] = np.array(3) >>> a array([array(3), None], dtype=object) >>> a.fill(np.array(3)) >>> a array([array(3), array(3)], dtype=object)
Where other forms of assignments will unpack the array being assigned:
>>> a[...] = np.array(3) >>> a array([3, 3], dtype=object)
- flags
Information about the memory layout of the array.
- C_CONTIGUOUS(C)
The data is in a single, C-style contiguous segment.
- F_CONTIGUOUS(F)
The data is in a single, Fortran-style contiguous segment.
- OWNDATA(O)
The array owns the memory it uses or borrows it from another object.
- WRITEABLE(W)
The data area can be written to. Setting this to False locks the data, making it read-only. A view (slice, etc.) inherits WRITEABLE from its base array at creation time, but a view of a writeable array may be subsequently locked while the base array remains writeable. (The opposite is not true, in that a view of a locked array may not be made writeable. However, currently, locking a base object does not lock any views that already reference it, so under that circumstance it is possible to alter the contents of a locked array via a previously created writeable view onto it.) Attempting to change a non-writeable array raises a RuntimeError exception.
- ALIGNED(A)
The data and all elements are aligned appropriately for the hardware.
- WRITEBACKIFCOPY(X)
This array is a copy of some other array. The C-API function PyArray_ResolveWritebackIfCopy must be called before deallocating to the base array will be updated with the contents of this array.
- FNC
F_CONTIGUOUS and not C_CONTIGUOUS.
- FORC
F_CONTIGUOUS or C_CONTIGUOUS (one-segment test).
- BEHAVED(B)
ALIGNED and WRITEABLE.
- CARRAY(CA)
BEHAVED and C_CONTIGUOUS.
- FARRAY(FA)
BEHAVED and F_CONTIGUOUS and not C_CONTIGUOUS.
Notes
The flags object can be accessed dictionary-like (as in
a.flags['WRITEABLE']), or by using lowercased attribute names (as ina.flags.writeable). Short flag names are only supported in dictionary access.Only the WRITEBACKIFCOPY, WRITEABLE, and ALIGNED flags can be changed by the user, via direct assignment to the attribute or dictionary entry, or by calling ndarray.setflags.
The array flags cannot be set arbitrarily:
WRITEBACKIFCOPY can only be set
False.ALIGNED can only be set
Trueif the data is truly aligned.WRITEABLE can only be set
Trueif the array owns its own memory or the ultimate owner of the memory exposes a writeable buffer interface or is a string.
Arrays can be both C-style and Fortran-style contiguous simultaneously. This is clear for 1-dimensional arrays, but can also be true for higher dimensional arrays.
Even for contiguous arrays a stride for a given dimension
arr.strides[dim]may be arbitrary ifarr.shape[dim] == 1or the array has no elements. It does not generally hold thatself.strides[-1] == self.itemsizefor C-style contiguous arrays orself.strides[0] == self.itemsizefor Fortran-style contiguous arrays is true.
- flat
A 1-D iterator over the array.
This is a numpy.flatiter instance, which acts similarly to, but is not a subclass of, Python’s built-in iterator object.
Examples
>>> x = np.arange(1, 7).reshape(2, 3) >>> x array([[1, 2, 3], [4, 5, 6]]) >>> x.flat[3] 4 >>> x.T array([[1, 4], [2, 5], [3, 6]]) >>> x.T.flat[3] 5 >>> type(x.flat) <class 'numpy.flatiter'>
An assignment example:
>>> x.flat = 3; x array([[3, 3, 3], [3, 3, 3]]) >>> x.flat[[1,4]] = 1; x array([[3, 1, 3], [3, 1, 3]])
- flatten(order='C')
Return a copy of the array collapsed into one dimension.
- Parameters:
order ({'C', 'F', 'A', 'K'}, optional) – ‘C’ means to flatten in row-major (C-style) order. ‘F’ means to flatten in column-major (Fortran- style) order. ‘A’ means to flatten in column-major order if a is Fortran contiguous in memory, row-major order otherwise. ‘K’ means to flatten a in the order the elements occur in memory. The default is ‘C’.
- Returns:
y – A copy of the input array, flattened to one dimension.
- Return type:
Examples
>>> a = np.array([[1,2], [3,4]]) >>> a.flatten() array([1, 2, 3, 4]) >>> a.flatten('F') array([1, 3, 2, 4])
- getfield(dtype, offset=0)
Returns a field of the given array as a certain type.
A field is a view of the array data with a given data-type. The values in the view are determined by the given type and the offset into the current array in bytes. The offset needs to be such that the view dtype fits in the array dtype; for example an array of dtype complex128 has 16-byte elements. If taking a view with a 32-bit integer (4 bytes), the offset needs to be between 0 and 12 bytes.
- Parameters:
dtype (str or dtype) – The data type of the view. The dtype size of the view can not be larger than that of the array itself.
offset (int) – Number of bytes to skip before beginning the element view.
Examples
>>> x = np.diag([1.+1.j]*2) >>> x[1, 1] = 2 + 4.j >>> x array([[1.+1.j, 0.+0.j], [0.+0.j, 2.+4.j]]) >>> x.getfield(np.float64) array([[1., 0.], [0., 2.]])
By choosing an offset of 8 bytes we can select the complex part of the array for our view:
>>> x.getfield(np.float64, offset=8) array([[1., 0.], [0., 4.]])
- imag
The imaginary part of the array.
Examples
>>> x = np.sqrt([1+0j, 0+1j]) >>> x.imag array([ 0. , 0.70710678]) >>> x.imag.dtype dtype('float64')
- item(*args)
Copy an element of an array to a standard Python scalar and return it.
- Parameters:
*args (Arguments (variable number and type)) –
none: in this case, the method only works for arrays with one element (a.size == 1), which element is copied into a standard Python scalar object and returned.
int_type: this argument is interpreted as a flat index into the array, specifying which element to copy and return.
tuple of int_types: functions as does a single int_type argument, except that the argument is interpreted as an nd-index into the array.
- Returns:
z – A copy of the specified element of the array as a suitable Python scalar
- Return type:
Standard Python scalar object
Notes
When the data type of a is longdouble or clongdouble, item() returns a scalar array object because there is no available Python scalar that would not lose information. Void arrays return a buffer object for item(), unless fields are defined, in which case a tuple is returned.
item is very similar to a[args], except, instead of an array scalar, a standard Python scalar is returned. This can be useful for speeding up access to elements of the array and doing arithmetic on elements of the array using Python’s optimized math.
Examples
>>> np.random.seed(123) >>> x = np.random.randint(9, size=(3, 3)) >>> x array([[2, 2, 6], [1, 3, 6], [1, 0, 1]]) >>> x.item(3) 1 >>> x.item(7) 0 >>> x.item((0, 1)) 2 >>> x.item((2, 2)) 1
- itemset(*args)
Insert scalar into an array (scalar is cast to array’s dtype, if possible)
There must be at least 1 argument, and define the last argument as item. Then,
a.itemset(*args)is equivalent to but faster thana[args] = item. The item should be a scalar value and args must select a single item in the array a.- Parameters:
*args (Arguments) – If one argument: a scalar, only used in case a is of size 1. If two arguments: the last argument is the value to be set and must be a scalar, the first argument specifies a single array element location. It is either an int or a tuple.
Notes
Compared to indexing syntax, itemset provides some speed increase for placing a scalar into a particular location in an ndarray, if you must do this. However, generally this is discouraged: among other problems, it complicates the appearance of the code. Also, when using itemset (and item) inside a loop, be sure to assign the methods to a local variable to avoid the attribute look-up at each loop iteration.
Examples
>>> np.random.seed(123) >>> x = np.random.randint(9, size=(3, 3)) >>> x array([[2, 2, 6], [1, 3, 6], [1, 0, 1]]) >>> x.itemset(4, 0) >>> x.itemset((2, 2), 9) >>> x array([[2, 2, 6], [1, 0, 6], [1, 0, 9]])
- itemsize
Length of one array element in bytes.
Examples
>>> x = np.array([1,2,3], dtype=np.float64) >>> x.itemsize 8 >>> x = np.array([1,2,3], dtype=np.complex128) >>> x.itemsize 16
- max(axis=None, out=None, keepdims=False, initial=<no value>, where=True)
Return the maximum along a given axis.
Refer to numpy.amax for full documentation.
See also
numpy.amaxequivalent function
- mean(axis=None, dtype=None, out=None, keepdims=False, *, where=True)
Returns the average of the array elements along given axis.
Refer to numpy.mean for full documentation.
See also
numpy.meanequivalent function
- min(axis=None, out=None, keepdims=False, initial=<no value>, where=True)
Return the minimum along a given axis.
Refer to numpy.amin for full documentation.
See also
numpy.aminequivalent function
- nbytes
Total bytes consumed by the elements of the array.
Notes
Does not include memory consumed by non-element attributes of the array object.
See also
sys.getsizeofMemory consumed by the object itself without parents in case view. This does include memory consumed by non-element attributes.
Examples
>>> x = np.zeros((3,5,2), dtype=np.complex128) >>> x.nbytes 480 >>> np.prod(x.shape) * x.itemsize 480
- ndim
Number of array dimensions.
Examples
>>> x = np.array([1, 2, 3]) >>> x.ndim 1 >>> y = np.zeros((2, 3, 4)) >>> y.ndim 3
- newbyteorder(new_order='S', /)
Return the array with the same data viewed with a different byte order.
Equivalent to:
arr.view(arr.dtype.newbytorder(new_order))
Changes are also made in all fields and sub-arrays of the array data type.
- Parameters:
new_order (string, optional) –
Byte order to force; a value from the byte order specifications below. new_order codes can be any of:
’S’ - swap dtype from current to opposite endian
{‘<’, ‘little’} - little endian
{‘>’, ‘big’} - big endian
{‘=’, ‘native’} - native order, equivalent to sys.byteorder
{‘|’, ‘I’} - ignore (no change to byte order)
The default value (‘S’) results in swapping the current byte order.
- Returns:
new_arr – New array object with the dtype reflecting given change to the byte order.
- Return type:
array
- nonzero()
Return the indices of the elements that are non-zero.
Refer to numpy.nonzero for full documentation.
See also
numpy.nonzeroequivalent function
- partition(kth, axis=-1, kind='introselect', order=None)
Rearranges the elements in the array in such a way that the value of the element in kth position is in the position it would be in a sorted array. All elements smaller than the kth element are moved before this element and all equal or greater are moved behind it. The ordering of the elements in the two partitions is undefined.
Added in version 1.8.0.
- Parameters:
kth (int or sequence of ints) –
Element index to partition by. The kth element value will be in its final sorted position and all smaller elements will be moved before it and all equal or greater elements behind it. The order of all elements in the partitions is undefined. If provided with a sequence of kth it will partition all elements indexed by kth of them into their sorted position at once.
Deprecated since version 1.22.0: Passing booleans as index is deprecated.
axis (int, optional) – Axis along which to sort. Default is -1, which means sort along the last axis.
kind ({'introselect'}, optional) – Selection algorithm. Default is ‘introselect’.
order (str or list of str, optional) – When a is an array with fields defined, this argument specifies which fields to compare first, second, etc. A single field can be specified as a string, and not all fields need to be specified, but unspecified fields will still be used, in the order in which they come up in the dtype, to break ties.
See also
numpy.partitionReturn a partitioned copy of an array.
argpartitionIndirect partition.
sortFull sort.
Notes
See
np.partitionfor notes on the different algorithms.Examples
>>> a = np.array([3, 4, 2, 1]) >>> a.partition(3) >>> a array([2, 1, 3, 4])
>>> a.partition((1, 3)) >>> a array([1, 2, 3, 4])
- prod(axis=None, dtype=None, out=None, keepdims=False, initial=1, where=True)
Return the product of the array elements over the given axis
Refer to numpy.prod for full documentation.
See also
numpy.prodequivalent function
- ptp(axis=None, out=None, keepdims=False)
Peak to peak (maximum - minimum) value along a given axis.
Refer to numpy.ptp for full documentation.
See also
numpy.ptpequivalent function
- put(indices, values, mode='raise')
Set
a.flat[n] = values[n]for all n in indices.Refer to numpy.put for full documentation.
See also
numpy.putequivalent function
- ravel([order])
Return a flattened array.
Refer to numpy.ravel for full documentation.
See also
numpy.ravelequivalent function
ndarray.flata flat iterator on the array.
- real
The real part of the array.
Examples
>>> x = np.sqrt([1+0j, 0+1j]) >>> x.real array([ 1. , 0.70710678]) >>> x.real.dtype dtype('float64')
See also
numpy.realequivalent function
- repeat(repeats, axis=None)
Repeat elements of an array.
Refer to numpy.repeat for full documentation.
See also
numpy.repeatequivalent function
- reshape(shape, order='C')
Returns an array containing the same data with a new shape.
Refer to numpy.reshape for full documentation.
See also
numpy.reshapeequivalent function
Notes
Unlike the free function numpy.reshape, this method on ndarray allows the elements of the shape parameter to be passed in as separate arguments. For example,
a.reshape(10, 11)is equivalent toa.reshape((10, 11)).
- resize(new_shape, refcheck=True)
Change shape and size of array in-place.
- Parameters:
new_shape (tuple of ints, or n ints) – Shape of resized array.
refcheck (bool, optional) – If False, reference count will not be checked. Default is True.
- Return type:
None
- Raises:
ValueError – If a does not own its own data or references or views to it exist, and the data memory must be changed. PyPy only: will always raise if the data memory must be changed, since there is no reliable way to determine if references or views to it exist.
SystemError – If the order keyword argument is specified. This behaviour is a bug in NumPy.
See also
resizeReturn a new array with the specified shape.
Notes
This reallocates space for the data area if necessary.
Only contiguous arrays (data elements consecutive in memory) can be resized.
The purpose of the reference count check is to make sure you do not use this array as a buffer for another Python object and then reallocate the memory. However, reference counts can increase in other ways so if you are sure that you have not shared the memory for this array with another Python object, then you may safely set refcheck to False.
Examples
Shrinking an array: array is flattened (in the order that the data are stored in memory), resized, and reshaped:
>>> a = np.array([[0, 1], [2, 3]], order='C') >>> a.resize((2, 1)) >>> a array([[0], [1]])
>>> a = np.array([[0, 1], [2, 3]], order='F') >>> a.resize((2, 1)) >>> a array([[0], [2]])
Enlarging an array: as above, but missing entries are filled with zeros:
>>> b = np.array([[0, 1], [2, 3]]) >>> b.resize(2, 3) # new_shape parameter doesn't have to be a tuple >>> b array([[0, 1, 2], [3, 0, 0]])
Referencing an array prevents resizing…
>>> c = a >>> a.resize((1, 1)) Traceback (most recent call last): ... ValueError: cannot resize an array that references or is referenced ...
Unless refcheck is False:
>>> a.resize((1, 1), refcheck=False) >>> a array([[0]]) >>> c array([[0]])
- round(decimals=0, out=None)
Return a with each element rounded to the given number of decimals.
Refer to numpy.around for full documentation.
See also
numpy.aroundequivalent function
- searchsorted(v, side='left', sorter=None)
Find indices where elements of v should be inserted in a to maintain order.
For full documentation, see numpy.searchsorted
See also
numpy.searchsortedequivalent function
- setfield(val, dtype, offset=0)
Put a value into a specified place in a field defined by a data-type.
Place val into a’s field defined by dtype and beginning offset bytes into the field.
- Parameters:
val (object) – Value to be placed in field.
dtype (dtype object) – Data-type of the field in which to place val.
offset (int, optional) – The number of bytes into the field at which to place val.
- Return type:
None
See also
Examples
>>> x = np.eye(3) >>> x.getfield(np.float64) array([[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]]) >>> x.setfield(3, np.int32) >>> x.getfield(np.int32) array([[3, 3, 3], [3, 3, 3], [3, 3, 3]], dtype=int32) >>> x array([[1.0e+000, 1.5e-323, 1.5e-323], [1.5e-323, 1.0e+000, 1.5e-323], [1.5e-323, 1.5e-323, 1.0e+000]]) >>> x.setfield(np.eye(3), np.int32) >>> x array([[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])
- setflags(write=None, align=None, uic=None)
Set array flags WRITEABLE, ALIGNED, WRITEBACKIFCOPY, respectively.
These Boolean-valued flags affect how numpy interprets the memory area used by a (see Notes below). The ALIGNED flag can only be set to True if the data is actually aligned according to the type. The WRITEBACKIFCOPY and flag can never be set to True. The flag WRITEABLE can only be set to True if the array owns its own memory, or the ultimate owner of the memory exposes a writeable buffer interface, or is a string. (The exception for string is made so that unpickling can be done without copying memory.)
- Parameters:
write (bool, optional) – Describes whether or not a can be written to.
align (bool, optional) – Describes whether or not a is aligned properly for its type.
uic (bool, optional) – Describes whether or not a is a copy of another “base” array.
Notes
Array flags provide information about how the memory area used for the array is to be interpreted. There are 7 Boolean flags in use, only four of which can be changed by the user: WRITEBACKIFCOPY, WRITEABLE, and ALIGNED.
WRITEABLE (W) the data area can be written to;
ALIGNED (A) the data and strides are aligned appropriately for the hardware (as determined by the compiler);
WRITEBACKIFCOPY (X) this array is a copy of some other array (referenced by .base). When the C-API function PyArray_ResolveWritebackIfCopy is called, the base array will be updated with the contents of this array.
All flags can be accessed using the single (upper case) letter as well as the full name.
Examples
>>> y = np.array([[3, 1, 7], ... [2, 0, 0], ... [8, 5, 9]]) >>> y array([[3, 1, 7], [2, 0, 0], [8, 5, 9]]) >>> y.flags C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False >>> y.setflags(write=0, align=0) >>> y.flags C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : False ALIGNED : False WRITEBACKIFCOPY : False >>> y.setflags(uic=1) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: cannot set WRITEBACKIFCOPY flag to True
- shape
Tuple of array dimensions.
The shape property is usually used to get the current shape of an array, but may also be used to reshape the array in-place by assigning a tuple of array dimensions to it. As with numpy.reshape, one of the new shape dimensions can be -1, in which case its value is inferred from the size of the array and the remaining dimensions. Reshaping an array in-place will fail if a copy is required.
Warning
Setting
arr.shapeis discouraged and may be deprecated in the future. Using ndarray.reshape is the preferred approach.Examples
>>> x = np.array([1, 2, 3, 4]) >>> x.shape (4,) >>> y = np.zeros((2, 3, 4)) >>> y.shape (2, 3, 4) >>> y.shape = (3, 8) >>> y array([[ 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0.]]) >>> y.shape = (3, 6) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: total size of new array must be unchanged >>> np.zeros((4,2))[::2].shape = (-1,) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: Incompatible shape for in-place modification. Use `.reshape()` to make a copy with the desired shape.
See also
numpy.shapeEquivalent getter function.
numpy.reshapeFunction similar to setting
shape.ndarray.reshapeMethod similar to setting
shape.
- size
Number of elements in the array.
Equal to
np.prod(a.shape), i.e., the product of the array’s dimensions.Notes
a.size returns a standard arbitrary precision Python integer. This may not be the case with other methods of obtaining the same value (like the suggested
np.prod(a.shape), which returns an instance ofnp.int_), and may be relevant if the value is used further in calculations that may overflow a fixed size integer type.Examples
>>> x = np.zeros((3, 5, 2), dtype=np.complex128) >>> x.size 30 >>> np.prod(x.shape) 30
- sort(axis=-1, kind=None, order=None)
Sort an array in-place. Refer to numpy.sort for full documentation.
- Parameters:
axis (int, optional) – Axis along which to sort. Default is -1, which means sort along the last axis.
kind ({'quicksort', 'mergesort', 'heapsort', 'stable'}, optional) –
Sorting algorithm. The default is ‘quicksort’. Note that both ‘stable’ and ‘mergesort’ use timsort under the covers and, in general, the actual implementation will vary with datatype. The ‘mergesort’ option is retained for backwards compatibility.
Changed in version 1.15.0: The ‘stable’ option was added.
order (str or list of str, optional) – When a is an array with fields defined, this argument specifies which fields to compare first, second, etc. A single field can be specified as a string, and not all fields need be specified, but unspecified fields will still be used, in the order in which they come up in the dtype, to break ties.
See also
numpy.sortReturn a sorted copy of an array.
numpy.argsortIndirect sort.
numpy.lexsortIndirect stable sort on multiple keys.
numpy.searchsortedFind elements in sorted array.
numpy.partitionPartial sort.
Notes
See numpy.sort for notes on the different sorting algorithms.
Examples
>>> a = np.array([[1,4], [3,1]]) >>> a.sort(axis=1) >>> a array([[1, 4], [1, 3]]) >>> a.sort(axis=0) >>> a array([[1, 3], [1, 4]])
Use the order keyword to specify a field to use when sorting a structured array:
>>> a = np.array([('a', 2), ('c', 1)], dtype=[('x', 'S1'), ('y', int)]) >>> a.sort(order='y') >>> a array([(b'c', 1), (b'a', 2)], dtype=[('x', 'S1'), ('y', '<i8')])
- squeeze(axis=None)
Remove axes of length one from a.
Refer to numpy.squeeze for full documentation.
See also
numpy.squeezeequivalent function
- std(axis=None, dtype=None, out=None, ddof=0, keepdims=False, *, where=True)
Returns the standard deviation of the array elements along given axis.
Refer to numpy.std for full documentation.
See also
numpy.stdequivalent function
- strides
Tuple of bytes to step in each dimension when traversing an array.
The byte offset of element
(i[0], i[1], ..., i[n])in an array a is:offset = sum(np.array(i) * a.strides)
A more detailed explanation of strides can be found in the “ndarray.rst” file in the NumPy reference guide.
Warning
Setting
arr.stridesis discouraged and may be deprecated in the future. numpy.lib.stride_tricks.as_strided should be preferred to create a new view of the same data in a safer way.Notes
Imagine an array of 32-bit integers (each 4 bytes):
x = np.array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]], dtype=np.int32)
This array is stored in memory as 40 bytes, one after the other (known as a contiguous block of memory). The strides of an array tell us how many bytes we have to skip in memory to move to the next position along a certain axis. For example, we have to skip 4 bytes (1 value) to move to the next column, but 20 bytes (5 values) to get to the same position in the next row. As such, the strides for the array x will be
(20, 4).See also
numpy.lib.stride_tricks.as_stridedExamples
>>> y = np.reshape(np.arange(2*3*4), (2,3,4)) >>> y array([[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]], [[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]]) >>> y.strides (48, 16, 4) >>> y[1,1,1] 17 >>> offset=sum(y.strides * np.array((1,1,1))) >>> offset/y.itemsize 17
>>> x = np.reshape(np.arange(5*6*7*8), (5,6,7,8)).transpose(2,3,1,0) >>> x.strides (32, 4, 224, 1344) >>> i = np.array([3,5,2,2]) >>> offset = sum(i * x.strides) >>> x[3,5,2,2] 813 >>> offset / x.itemsize 813
- sum(axis=None, dtype=None, out=None, keepdims=False, initial=0, where=True)
Return the sum of the array elements over the given axis.
Refer to numpy.sum for full documentation.
See also
numpy.sumequivalent function
- swapaxes(axis1, axis2)
Return a view of the array with axis1 and axis2 interchanged.
Refer to numpy.swapaxes for full documentation.
See also
numpy.swapaxesequivalent function
- take(indices, axis=None, out=None, mode='raise')
Return an array formed from the elements of a at the given indices.
Refer to numpy.take for full documentation.
See also
numpy.takeequivalent function
- tobytes(order='C')
Construct Python bytes containing the raw data bytes in the array.
Constructs Python bytes showing a copy of the raw contents of data memory. The bytes object is produced in C-order by default. This behavior is controlled by the
orderparameter.Added in version 1.9.0.
- Parameters:
order ({'C', 'F', 'A'}, optional) – Controls the memory layout of the bytes object. ‘C’ means C-order, ‘F’ means F-order, ‘A’ (short for Any) means ‘F’ if a is Fortran contiguous, ‘C’ otherwise. Default is ‘C’.
- Returns:
s – Python bytes exhibiting a copy of a’s raw data.
- Return type:
bytes
See also
frombufferInverse of this operation, construct a 1-dimensional array from Python bytes.
Examples
>>> x = np.array([[0, 1], [2, 3]], dtype='<u2') >>> x.tobytes() b'\x00\x00\x01\x00\x02\x00\x03\x00' >>> x.tobytes('C') == x.tobytes() True >>> x.tobytes('F') b'\x00\x00\x02\x00\x01\x00\x03\x00'
- tofile(fid, sep='', format='%s')
Write array to a file as text or binary (default).
Data is always written in ‘C’ order, independent of the order of a. The data produced by this method can be recovered using the function fromfile().
- Parameters:
fid (file or str or Path) –
An open file object, or a string containing a filename.
Changed in version 1.17.0: pathlib.Path objects are now accepted.
sep (str) – Separator between array items for text output. If “” (empty), a binary file is written, equivalent to
file.write(a.tobytes()).format (str) – Format string for text file output. Each entry in the array is formatted to text by first converting it to the closest Python type, and then using “format” % item.
Notes
This is a convenience function for quick storage of array data. Information on endianness and precision is lost, so this method is not a good choice for files intended to archive data or transport data between machines with different endianness. Some of these problems can be overcome by outputting the data as text files, at the expense of speed and file size.
When fid is a file object, array contents are directly written to the file, bypassing the file object’s
writemethod. As a result, tofile cannot be used with files objects supporting compression (e.g., GzipFile) or file-like objects that do not supportfileno()(e.g., BytesIO).
- tolist()
Return the array as an
a.ndim-levels deep nested list of Python scalars.Return a copy of the array data as a (nested) Python list. Data items are converted to the nearest compatible builtin Python type, via the ~numpy.ndarray.item function.
If
a.ndimis 0, then since the depth of the nested list is 0, it will not be a list at all, but a simple Python scalar.- Parameters:
none
- Returns:
y – The possibly nested list of array elements.
- Return type:
object, or list of object, or list of list of object, or …
Notes
The array may be recreated via
a = np.array(a.tolist()), although this may sometimes lose precision.Examples
For a 1D array,
a.tolist()is almost the same aslist(a), except thattolistchanges numpy scalars to Python scalars:>>> a = np.uint32([1, 2]) >>> a_list = list(a) >>> a_list [1, 2] >>> type(a_list[0]) <class 'numpy.uint32'> >>> a_tolist = a.tolist() >>> a_tolist [1, 2] >>> type(a_tolist[0]) <class 'int'>
Additionally, for a 2D array,
tolistapplies recursively:>>> a = np.array([[1, 2], [3, 4]]) >>> list(a) [array([1, 2]), array([3, 4])] >>> a.tolist() [[1, 2], [3, 4]]
The base case for this recursion is a 0D array:
>>> a = np.array(1) >>> list(a) Traceback (most recent call last): ... TypeError: iteration over a 0-d array >>> a.tolist() 1
- tostring(order='C')
A compatibility alias for tobytes, with exactly the same behavior.
Despite its name, it returns bytes not strs.
Deprecated since version 1.19.0.
- trace(offset=0, axis1=0, axis2=1, dtype=None, out=None)
Return the sum along diagonals of the array.
Refer to numpy.trace for full documentation.
See also
numpy.traceequivalent function
- transpose(*axes)
Returns a view of the array with axes transposed.
Refer to numpy.transpose for full documentation.
- Parameters:
axes (None, tuple of ints, or n ints) –
None or no argument: reverses the order of the axes.
tuple of ints: i in the j-th place in the tuple means that the array’s i-th axis becomes the transposed array’s j-th axis.
n ints: same as an n-tuple of the same ints (this form is intended simply as a “convenience” alternative to the tuple form).
- Returns:
p – View of the array with its axes suitably permuted.
- Return type:
See also
transposeEquivalent function.
ndarray.TArray property returning the array transposed.
ndarray.reshapeGive a new shape to an array without changing its data.
Examples
>>> a = np.array([[1, 2], [3, 4]]) >>> a array([[1, 2], [3, 4]]) >>> a.transpose() array([[1, 3], [2, 4]]) >>> a.transpose((1, 0)) array([[1, 3], [2, 4]]) >>> a.transpose(1, 0) array([[1, 3], [2, 4]])
>>> a = np.array([1, 2, 3, 4]) >>> a array([1, 2, 3, 4]) >>> a.transpose() array([1, 2, 3, 4])
- var(axis=None, dtype=None, out=None, ddof=0, keepdims=False, *, where=True)
Returns the variance of the array elements, along given axis.
Refer to numpy.var for full documentation.
See also
numpy.varequivalent function
- view([dtype][, type])
New view of array with the same data.
Note
Passing None for
dtypeis different from omitting the parameter, since the former invokesdtype(None)which is an alias fordtype('float_').- Parameters:
dtype (data-type or ndarray sub-class, optional) – Data-type descriptor of the returned view, e.g., float32 or int16. Omitting it results in the view having the same data-type as a. This argument can also be specified as an ndarray sub-class, which then specifies the type of the returned object (this is equivalent to setting the
typeparameter).type (Python type, optional) – Type of the returned view, e.g., ndarray or matrix. Again, omission of the parameter results in type preservation.
Notes
a.view()is used two different ways:a.view(some_dtype)ora.view(dtype=some_dtype)constructs a view of the array’s memory with a different data-type. This can cause a reinterpretation of the bytes of memory.a.view(ndarray_subclass)ora.view(type=ndarray_subclass)just returns an instance of ndarray_subclass that looks at the same array (same shape, dtype, etc.) This does not cause a reinterpretation of the memory.For
a.view(some_dtype), ifsome_dtypehas a different number of bytes per entry than the previous dtype (for example, converting a regular array to a structured array), then the last axis ofamust be contiguous. This axis will be resized in the result.Changed in version 1.23.0: Only the last axis needs to be contiguous. Previously, the entire array had to be C-contiguous.
Examples
>>> x = np.array([(1, 2)], dtype=[('a', np.int8), ('b', np.int8)])
Viewing array data using a different type and dtype:
>>> y = x.view(dtype=np.int16, type=np.matrix) >>> y matrix([[513]], dtype=int16) >>> print(type(y)) <class 'numpy.matrix'>
Creating a view on a structured array so it can be used in calculations
>>> x = np.array([(1, 2),(3,4)], dtype=[('a', np.int8), ('b', np.int8)]) >>> xv = x.view(dtype=np.int8).reshape(-1,2) >>> xv array([[1, 2], [3, 4]], dtype=int8) >>> xv.mean(0) array([2., 3.])
Making changes to the view changes the underlying array
>>> xv[0,1] = 20 >>> x array([(1, 20), (3, 4)], dtype=[('a', 'i1'), ('b', 'i1')])
Using a view to convert an array to a recarray:
>>> z = x.view(np.recarray) >>> z.a array([1, 3], dtype=int8)
Views share data:
>>> x[0] = (9, 10) >>> z[0] (9, 10)
Views that change the dtype size (bytes per entry) should normally be avoided on arrays defined by slices, transposes, fortran-ordering, etc.:
>>> x = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.int16) >>> y = x[:, ::2] >>> y array([[1, 3], [4, 6]], dtype=int16) >>> y.view(dtype=[('width', np.int16), ('length', np.int16)]) Traceback (most recent call last): ... ValueError: To change to a dtype of a different size, the last axis must be contiguous >>> z = y.copy() >>> z.view(dtype=[('width', np.int16), ('length', np.int16)]) array([[(1, 3)], [(4, 6)]], dtype=[('width', '<i2'), ('length', '<i2')])
However, views that change dtype are totally fine for arrays with a contiguous last axis, even if the rest of the axes are not C-contiguous:
>>> x = np.arange(2 * 3 * 4, dtype=np.int8).reshape(2, 3, 4) >>> x.transpose(1, 0, 2).view(np.int16) array([[[ 256, 770], [3340, 3854]], [[1284, 1798], [4368, 4882]], [[2312, 2826], [5396, 5910]]], dtype=int16)
- pyranges1.ext.stats.relative_distance(p: PyRanges, other: PyRanges, **_) DataFrame
Compute spatial correlation between two sets.
Metric which describes relative distance between each interval in one set and two closest intervals in another.
- Parameters:
p (PyRanges) – Intervals to compare.
other (PyRanges) – Intervals to compare with.
chromsizes (int, dict, DataFrame or PyRanges) – Integer representing genome length or mapping from chromosomes to its length.
strandedness ({None, "same", "opposite", False}, default None, i.e. "auto") – Whether to compute without regards to strand or on same or opposite.
- Returns:
DataFrame containing the frequency of each relative distance.
- Return type:
pandas.DataFrame
See also
pyranges.stats.jaccardcompute the jaccard coefficient
pyranges.stats.forbescompute the forbes coefficient
Examples
>>> gr1, gr2 = pr.example_data.chipseq, pr.example_data.chipseq_background >>> gr = pd.concat([gr1, gr1.head(4), gr2.tail(4)]) >>> chromsizes = pr.example_data.chromsizes >>> pr.stats.relative_distance(gr, gr2) reldist count total fraction 0 0.00 4 18 0.222222 1 0.03 1 18 0.055556 2 0.04 1 18 0.055556 3 0.10 1 18 0.055556 4 0.12 1 18 0.055556 5 0.13 1 18 0.055556 6 0.19 1 18 0.055556 7 0.23 1 18 0.055556 8 0.24 1 18 0.055556 9 0.38 1 18 0.055556 10 0.41 2 18 0.111111 11 0.42 1 18 0.055556 12 0.43 2 18 0.111111
- pyranges1.ext.stats.rowbased_pearson(x: ndarray | DataFrame, y: ndarray | DataFrame) ndarray
Fast row-based Pearson’s correlation.
- Parameters:
x (matrix-like) – 2D numerical matrix. Same size as y.
y (matrix-like) – 2D numerical matrix. Same size as x.
- Returns:
Array with same length as input, where values are P-values.
- Return type:
numpy.array
See also
pyranges.stats.rowbased_spearmanfast row-based Spearman’s correlation.
pyranges.stats.fdrcorrect for multiple testing
Examples
>>> x = np.array([[7, 2, 9], [3, 6, 0], [0, 6, 3]]) >>> y = np.array([[5, 3, 2], [9, 6, 0], [7, 3, 5]])
Perform Pearson’s correlation pairwise on each row in 10x10 matrixes:
>>> pr.stats.rowbased_pearson(x, y) array([-0.09078413, 0.65465367, -1. ])
- pyranges1.ext.stats.rowbased_rankdata(data: ndarray) DataFrame
Rank order of entries in each row.
Same as SciPy rankdata with method=mean.
- Parameters:
data (matrix-like) – The data to find order of.
- Returns:
DataFrame where values are order of data.
- Return type:
Pandas.DataFrame
Examples
>>> x = np.random.randint(10, size=(3, 4)) >>> x = np.array([[3, 7, 6, 0], [1, 3, 8, 9], [5, 9, 3, 5]]) >>> pr.stats.rowbased_rankdata(x) 0 1 2 3 0 2.0 4.0 3.0 1.0 1 1.0 2.0 3.0 4.0 2 2.5 4.0 1.0 2.5
- pyranges1.ext.stats.rowbased_spearman(x: ndarray, y: ndarray) ndarray
Fast row-based Spearman’s correlation.
- Parameters:
x (matrix-like) – 2D numerical matrix. Same size as y.
y (matrix-like) – 2D numerical matrix. Same size as x.
- Returns:
Array with same length as input, where values are P-values.
- Return type:
numpy.array
See also
pyranges.stats.rowbased_pearsonfast row-based Pearson’s correlation.
pyranges.stats.fdrcorrect for multiple testing
Examples
>>> x = np.array([[7, 2, 9], [3, 6, 0], [0, 6, 3]]) >>> y = np.array([[5, 3, 2], [9, 6, 0], [7, 3, 5]])
Perform Spearman’s correlation pairwise on each row in 10x10 matrixes:
>>> pr.stats.rowbased_spearman(x, y) array([-0.5, 0.5, -1. ])
- pyranges1.ext.stats.simes(df: PyRanges, by: str | list[str], pcol: str, *, keep_position: bool = False) PyRanges | DataFrame
Apply Simes method for giving dependent events a p-value.
- Parameters:
df (pandas.DataFrame) – Data to analyse with Simes.
by (str or list of str) – Features equal in these columns will be merged with Simes.
pcol (str) – Name of column with p-values.
keep_position (bool, default False) – Keep columns “Chromosome”, “Start”, “End” and “Strand” if they exist.
See also
pr.stats.fdrcorrect for multiple testing
Examples
>>> s = '''Chromosome Start End Strand Gene PValue ... 1 10 20 + P53 0.0001 ... 1 20 35 + P53 0.0002 ... 1 30 40 + P53 0.0003 ... 2 60 65 - FOX 0.05 ... 2 70 75 - FOX 0.0000001 ... 2 80 90 - FOX 0.0000021'''
>>> gr = pr.from_string(s) >>> gr index | Chromosome Start End Strand Gene PValue int64 | int64 int64 int64 str str float64 ------- --- ------------ ------- ------- -------- ------ --------- 0 | 1 10 20 + P53 0.0001 1 | 1 20 35 + P53 0.0002 2 | 1 30 40 + P53 0.0003 3 | 2 60 65 - FOX 0.05 4 | 2 70 75 - FOX 1e-07 5 | 2 80 90 - FOX 2.1e-06 PyRanges with 6 rows, 6 columns, and 1 index columns. Contains 2 chromosomes and 2 strands.
>>> simes = pr.stats.simes(gr, "Gene", "PValue") >>> simes Gene Simes 0 FOX 3.000000e-07 1 P53 3.000000e-04
>>> pr.stats.simes(gr, "Gene", "PValue", keep_position=True) index | Chromosome Start End Simes Strand Gene int64 | int64 int64 int64 float64 str str ------- --- ------------ ------- ------- --------- -------- ------ 0 | 2 60 90 1e-07 - FOX 1 | 1 10 40 0.0001 + P53 PyRanges with 2 rows, 6 columns, and 1 index columns. Contains 2 chromosomes and 2 strands.
- pyranges1.ext.stats.sqrt(x, /)
Return the square root of x.
- pyranges1.ext.stats.strand_behavior_from_validated_use_strand(df: PyRanges, use_strand: Literal['auto'] | bool) Literal['auto', 'same', 'opposite', 'ignore']
Return strand_behavior based on use_strand bool.
If strand is True, returns ‘same’ strand behavior, otherwise ‘ignore’ strand behavior.
This function must be used with use_strand validated and converted to a boolean value.
- pyranges1.ext.stats.use_strand_from_validated_strand_behavior(self: PyRanges, other: PyRanges, strand_behavior: Literal['auto', 'same', 'opposite', 'ignore']) bool
Return use_strand based on strand_behavior.
If strand_behavior is ‘ignore’, returns False, otherwise True if both PyRanges contain valid strand info.
This function must be used with strand_behavior validated and converted to either ‘same’, ‘opposite’ or ‘ignore’.
- pyranges1.ext.stats.validate_and_convert_strand_behavior(self: PyRanges, other: PyRanges, strand_behavior: Literal['auto', 'same', 'opposite', 'ignore']) Literal['same', 'opposite', 'ignore']
Validate and convert strand behavior.
Used inside various methods accepting two PyRanges as input. It processes strand_behavior option provided by the user upon calling the function. This function validates the option and converts it to a valid unambiguous value.