pandas Groupby: split-?-combine#

When choosing what groupby operations to run, pandas offers many options. Namely, you can choose to use one of these three:

  1. agg or aggregate

  2. transform

  3. apply

This blog post takes the guesswork out of whether you should agg, transform, or apply your groupby operations on pandas objects.

Reducing vs Non-Reducing Functions#

To better grasp the differences between pandas groupby, agg, transform and apply, we first need to differentiate reducing functions from non-reducing functions.

from IPython.display import Markdown, display 
from pandas import Series, DataFrame, Index

from numpy.random import default_rng

x = Series([1, 2, 3, 4, 5])

x
0    1
1    2
2    3
3    4
4    5
dtype: int64

Reducing functions N → 1

  • Reduces the size of the input array to a scalar output.

    • sum([1, 2, 3])6 is an example of a reduction. The input [1, 2, 3] is being reduced to a single value 6 by some operation sum

  • These are commonly aggregation functions (e.g. sum, mean, median)

    • Can also be item selections (e.g., first, nth, last)

x.sum()
15

Non-Reducing functions N → N

  • Maintains the original size of the input array while transforming the original values

    • log10([10, 100, 1000])[1, 2, 3] is an example of a non reducing transformation because the output has just as many values as the input

  • Functions operate “element-wise” (on each element) in an array and return a new value

    • e.g., log, log10, pow, cumsum, add, subtract

x.pow(2)
0     1
1     4
2     9
3    16
4    25
dtype: int64

Groupby Operations#

In pandas, groupby operations work on a “split → apply → combine” basis. This means that the data is first split into groups, some function is then applied to each group, and then those resultant pieces are recombined into the final DataFrame

But, if it’s as simple as the operation depicted above, then why does pandas have 3 options when performing a .groupby operation: .agg, .transform, and .apply?

The reason is in the speed of the operation. By laying out certain assumptions of the outputted shape of the combination step (e.g., N → N or N → 1), pandas can take some shortcuts to ensure that our operations are applied as quickly as possible. As you may have guessed, the resultant shape is very closely tied to whether the function we want to apply is a reducing or a non-reducing function!

Let’s make some data to demonstrate:

rng = default_rng(0)

df = DataFrame(
    index=(idx := Index(["a", "b", "c"], name='Group').repeat(3)),
    data={
        'values1': [34, 25, 20, 10, 12, 1, 3, 0, 7],
        'values2': [39, 41, 15, 20, 25, 31, 10, 9, 4],
    }
)

df
values1 values2
Group
a 34 39
a 25 41
a 20 15
b 10 20
b 12 25
b 1 31
c 3 10
c 0 9
c 7 4

DataFrame.groupby(…).agg#

  • .agg only works with functions that reduce.

  • For user defined functions, .agg operates on each column independently this means that you cannot operate across columns with .agg

When working with multiple columns—as our data set contains—pandas will apply the function to each column within each group separately. This means it cannot use .groupby(…).agg() for values from multiple columns in the function passed to .agg.

There are many ways to pass functions into .agg. One is to pass multiple reduction functions as a list. Note that I receive an output for each of my original columns. To read on how else you can pass in functions to .agg, check out the pandas.core.groupby.DataFrameGroupBy.aggregate documentation

df.groupby('Group').agg([sum, 'mean', lambda s: False])
values1 values2
sum mean <lambda_0> sum mean <lambda_0>
Group
a 79 26.333333 False 95 31.666667 False
b 23 7.666667 False 76 25.333333 False
c 10 3.333333 False 23 7.666667 False

If we pass a user defined function that does not reduce, we see this error:

df.groupby('Group')['values1'].agg(lambda s: s * 2)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [6], in <cell line: 1>()
----> 1 df.groupby('Group')['values1'].agg(lambda s: s * 2)

File ~/.pyenv/versions/dutc-site/lib/python3.10/site-packages/pandas/core/groupby/generic.py:287, in SeriesGroupBy.aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
    284     return self._python_agg_general(func, *args, **kwargs)
    286 try:
--> 287     return self._python_agg_general(func, *args, **kwargs)
    288 except KeyError:
    289     # TODO: KeyError is raised in _python_agg_general,
    290     #  see test_groupby.test_basic
    291     result = self._aggregate_named(func, *args, **kwargs)

File ~/.pyenv/versions/dutc-site/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1481, in GroupBy._python_agg_general(self, func, *args, **kwargs)
   1477 name = obj.name
   1479 try:
   1480     # if this function is invalid for this dtype, we will ignore it.
-> 1481     result = self.grouper.agg_series(obj, f)
   1482 except TypeError:
   1483     warn_dropping_nuisance_columns_deprecated(type(self), "agg")

File ~/.pyenv/versions/dutc-site/lib/python3.10/site-packages/pandas/core/groupby/ops.py:981, in BaseGrouper.agg_series(self, obj, func, preserve_dtype)
    978     preserve_dtype = True
    980 else:
--> 981     result = self._aggregate_series_pure_python(obj, func)
    983 npvalues = lib.maybe_convert_objects(result, try_float=False)
    984 if preserve_dtype:

File ~/.pyenv/versions/dutc-site/lib/python3.10/site-packages/pandas/core/groupby/ops.py:1010, in BaseGrouper._aggregate_series_pure_python(self, obj, func)
   1006 res = libreduction.extract_result(res)
   1008 if not initialized:
   1009     # We only do this validation on the first iteration
-> 1010     libreduction.check_result_array(res, group.dtype)
   1011     initialized = True
   1013 counts[i] = group.shape[0]

File ~/.pyenv/versions/dutc-site/lib/python3.10/site-packages/pandas/_libs/reduction.pyx:13, in pandas._libs.reduction.check_result_array()

File ~/.pyenv/versions/dutc-site/lib/python3.10/site-packages/pandas/_libs/reduction.pyx:21, in pandas._libs.reduction.check_result_array()

ValueError: Must produce aggregated value

DataFrame.groupby(…).transform#

.transform works with both non-reducing and reducing functions.

A groupby-transform operation ensures that the DataFrame does not change shape after the operation. The original Index remains intact and is used to index the resultant DataFrame.

Non-reducing function

df.groupby('Group').transform(lambda s: s.cumsum())
values1 values2
Group
a 34 39
a 59 80
a 79 95
b 10 20
b 22 45
b 23 76
c 3 10
c 3 19
c 10 23

Reducing function

However, when a reducing function is used, the outputted value maps back to the shape of the original input to ensure an N → N grouped operation.

df.groupby('Group').transform(sum)
values1 values2
Group
a 79 95
a 79 95
a 79 95
b 23 76
b 23 76
b 23 76
c 10 23
c 10 23
c 10 23

DataFrame.groupby(…).apply#

groupby(…).apply is the catch-all for pandas groupby operations. For user-defined functions, .apply works as follows:

  • N → M transformations (including N → 1 and N → N)

  • It works with entire sub-frames, instead of on a per column per group basis like .agg and .transform

The actual mechanics of .apply are similar to the above, except that pandas struggles to make the same assumptions as it can with the .agg and .transform methods. This enables maximum flexibility for passing user defined functions to a groupby operation, at the cost of some performance speed.

Wrap Up#

That takes us to the end of this post! This should help orient you towards how you can better leverage .agg and .transform in your work, and only use .apply when you really need it.

In summary, you can think of groupby operations with the following table:

Talk to you all next time!