pandas Groupby: split-?-combine#

When choosing what groupby operations to run, pandas offers many options. Namely, you can choose to use one of these three:

  1. agg or aggregate

  2. transform

  3. apply

This blog post takes the guesswork out of whether you should agg, transform, or apply your groupby operations on pandas objects.

Reducing vs Non-Reducing Functions#

To better grasp the differences between pandas groupby, agg, transform and apply, we first need to differentiate reducing functions from non-reducing functions.

from IPython.display import Markdown, display 
from pandas import Series, DataFrame, Index

from numpy.random import default_rng

x = Series([1, 2, 3, 4, 5])

x
0    1
1    2
2    3
3    4
4    5
dtype: int64

Reducing functions N → 1

  • Reduces the size of the input array to a scalar output.

    • sum([1, 2, 3])6 is an example of a reduction. The input [1, 2, 3] is being reduced to a single value 6 by some operation sum

  • These are commonly aggregation functions (e.g. sum, mean, median)

    • Can also be item selections (e.g., first, nth, last)

x.sum()
15

Non-Reducing functions N → N

  • Maintains the original size of the input array while transforming the original values

    • log10([10, 100, 1000])[1, 2, 3] is an example of a non reducing transformation because the output has just as many values as the input

  • Functions operate “element-wise” (on each element) in an array and return a new value

    • e.g., log, log10, pow, cumsum, add, subtract

x.pow(2)
0     1
1     4
2     9
3    16
4    25
dtype: int64

Groupby Operations#

In pandas, groupby operations work on a “split → apply → combine” basis. This means that the data is first split into groups, some function is then applied to each group, and then those resultant pieces are recombined into the final DataFrame

But, if it’s as simple as the operation depicted above, then why does pandas have 3 options when performing a .groupby operation: .agg, .transform, and .apply?

The reason is in the speed of the operation. By laying out certain assumptions of the outputted shape of the combination step (e.g., N → N or N → 1), pandas can take some shortcuts to ensure that our operations are applied as quickly as possible. As you may have guessed, the resultant shape is very closely tied to whether the function we want to apply is a reducing or a non-reducing function!

Let’s make some data to demonstrate:

rng = default_rng(0)

df = DataFrame(
    index=(idx := Index(["a", "b", "c"], name='Group').repeat(3)),
    data={
        'values1': [34, 25, 20, 10, 12, 1, 3, 0, 7],
        'values2': [39, 41, 15, 20, 25, 31, 10, 9, 4],
    }
)

df
values1 values2
Group
a 34 39
a 25 41
a 20 15
b 10 20
b 12 25
b 1 31
c 3 10
c 0 9
c 7 4

DataFrame.groupby(…).agg#

  • .agg only works with functions that reduce.

  • For user defined functions, .agg operates on each column independently this means that you cannot operate across columns with .agg

When working with multiple columns—as our data set contains—pandas will apply the function to each column within each group separately. This means it cannot use .groupby(…).agg() for values from multiple columns in the function passed to .agg.

There are many ways to pass functions into .agg. One is to pass multiple reduction functions as a list. Note that I receive an output for each of my original columns. To read on how else you can pass in functions to .agg, check out the pandas.core.groupby.DataFrameGroupBy.aggregate documentation

df.groupby('Group').agg([sum, 'mean', lambda s: False])
/tmp/ipykernel_28498/366122397.py:1: FutureWarning: The provided callable <built-in function sum> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  df.groupby('Group').agg([sum, 'mean', lambda s: False])
/tmp/ipykernel_28498/366122397.py:1: FutureWarning: The provided callable <built-in function sum> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  df.groupby('Group').agg([sum, 'mean', lambda s: False])
values1 values2
sum mean <lambda_0> sum mean <lambda_0>
Group
a 79 26.333333 False 95 31.666667 False
b 23 7.666667 False 76 25.333333 False
c 10 3.333333 False 23 7.666667 False

If we pass a user defined function that does not reduce, we see this error:

df.groupby('Group')['values1'].agg(lambda s: s * 2)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [6], in <cell line: 1>()
----> 1 df.groupby('Group')['values1'].agg(lambda s: s * 2)

File ~/.pyenv/versions/dutc-site/lib/python3.10/site-packages/pandas/core/groupby/generic.py:294, in SeriesGroupBy.aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
    291     return self._python_agg_general(func, *args, **kwargs)
    293 try:
--> 294     return self._python_agg_general(func, *args, **kwargs)
    295 except KeyError:
    296     # KeyError raised in test_groupby.test_basic is bc the func does
    297     #  a dictionary lookup on group.name, but group name is not
    298     #  pinned in _python_agg_general, only in _aggregate_named
    299     result = self._aggregate_named(func, *args, **kwargs)

File ~/.pyenv/versions/dutc-site/lib/python3.10/site-packages/pandas/core/groupby/generic.py:327, in SeriesGroupBy._python_agg_general(self, func, *args, **kwargs)
    324 f = lambda x: func(x, *args, **kwargs)
    326 obj = self._obj_with_exclusions
--> 327 result = self._grouper.agg_series(obj, f)
    328 res = obj._constructor(result, name=obj.name)
    329 return self._wrap_aggregated_output(res)

File ~/.pyenv/versions/dutc-site/lib/python3.10/site-packages/pandas/core/groupby/ops.py:864, in BaseGrouper.agg_series(self, obj, func, preserve_dtype)
    857 if not isinstance(obj._values, np.ndarray):
    858     # we can preserve a little bit more aggressively with EA dtype
    859     #  because maybe_cast_pointwise_result will do a try/except
    860     #  with _from_sequence.  NB we are assuming here that _from_sequence
    861     #  is sufficiently strict that it casts appropriately.
    862     preserve_dtype = True
--> 864 result = self._aggregate_series_pure_python(obj, func)
    866 npvalues = lib.maybe_convert_objects(result, try_float=False)
    867 if preserve_dtype:

File ~/.pyenv/versions/dutc-site/lib/python3.10/site-packages/pandas/core/groupby/ops.py:890, in BaseGrouper._aggregate_series_pure_python(self, obj, func)
    886 res = extract_result(res)
    888 if not initialized:
    889     # We only do this validation on the first iteration
--> 890     check_result_array(res, group.dtype)
    891     initialized = True
    893 result[i] = res

File ~/.pyenv/versions/dutc-site/lib/python3.10/site-packages/pandas/core/groupby/ops.py:88, in check_result_array(obj, dtype)
     84 if isinstance(obj, np.ndarray):
     85     if dtype != object:
     86         # If it is object dtype, the function can be a reduction/aggregation
     87         #  and still return an ndarray e.g. test_agg_over_numpy_arrays
---> 88         raise ValueError("Must produce aggregated value")

ValueError: Must produce aggregated value

DataFrame.groupby(…).transform#

.transform works with both non-reducing and reducing functions.

A groupby-transform operation ensures that the DataFrame does not change shape after the operation. The original Index remains intact and is used to index the resultant DataFrame.

Non-reducing function

df.groupby('Group').transform(lambda s: s.cumsum())
values1 values2
Group
a 34 39
a 59 80
a 79 95
b 10 20
b 22 45
b 23 76
c 3 10
c 3 19
c 10 23

Reducing function

However, when a reducing function is used, the outputted value maps back to the shape of the original input to ensure an N → N grouped operation.

df.groupby('Group').transform(sum)
/tmp/ipykernel_28498/2923381509.py:1: FutureWarning: The provided callable <built-in function sum> is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  df.groupby('Group').transform(sum)
values1 values2
Group
a 79 95
a 79 95
a 79 95
b 23 76
b 23 76
b 23 76
c 10 23
c 10 23
c 10 23

DataFrame.groupby(…).apply#

groupby(…).apply is the catch-all for pandas groupby operations. For user-defined functions, .apply works as follows:

  • N → M transformations (including N → 1 and N → N)

  • It works with entire sub-frames, instead of on a per column per group basis like .agg and .transform

The actual mechanics of .apply are similar to the above, except that pandas struggles to make the same assumptions as it can with the .agg and .transform methods. This enables maximum flexibility for passing user defined functions to a groupby operation, at the cost of some performance speed.

Wrap Up#

That takes us to the end of this post! This should help orient you towards how you can better leverage .agg and .transform in your work, and only use .apply when you really need it.

In summary, you can think of groupby operations with the following table:

Talk to you all next time!