pandas Groupby: split-?-combine#

When choosing what groupby operations to run, pandas offers many options. Namely, you can choose to use one of these three:

1. `agg` or `aggregate`

2. `transform`

3. `apply`

This blog post takes the guesswork out of whether you should `agg`, `transform`, or `apply` your groupby operations on pandas objects.

Reducing vs Non-Reducing Functions#

To better grasp the differences between pandas `groupby`, `agg`, `transform` and `apply`, we first need to differentiate reducing functions from non-reducing functions.

```from IPython.display import Markdown, display
from pandas import Series, DataFrame, Index

from numpy.random import default_rng

x = Series([1, 2, 3, 4, 5])

x
```
```0    1
1    2
2    3
3    4
4    5
dtype: int64
```

Reducing functions N → 1

• Reduces the size of the input array to a scalar output.

• `sum([1, 2, 3])``6` is an example of a reduction. The input `[1, 2, 3]` is being reduced to a single value `6` by some operation `sum`

• These are commonly aggregation functions (e.g. sum, mean, median)

• Can also be item selections (e.g., `first`, `nth`, `last`)

```x.sum()
```
```15
```

Non-Reducing functions N → N

• Maintains the original size of the input array while transforming the original values

• `log10([10, 100, 1000])``[1, 2, 3]` is an example of a non reducing transformation because the output has just as many values as the input

• Functions operate “element-wise” (on each element) in an array and return a new value

• e.g., `log`, `log10`, `pow`, `cumsum`, `add`, `subtract`

```x.pow(2)
```
```0     1
1     4
2     9
3    16
4    25
dtype: int64
```

Groupby Operations#

In pandas, `groupby` operations work on a “split → apply → combine” basis. This means that the data is first split into groups, some function is then applied to each group, and then those resultant pieces are recombined into the final `DataFrame`

But, if it’s as simple as the operation depicted above, then why does pandas have 3 options when performing a `.groupby` operation: `.agg`, `.transform`, and `.apply`?

The reason is in the speed of the operation. By laying out certain assumptions of the outputted shape of the combination step (e.g., N → N or N → 1), pandas can take some shortcuts to ensure that our operations are applied as quickly as possible. As you may have guessed, the resultant shape is very closely tied to whether the function we want to apply is a reducing or a non-reducing function!

Let’s make some data to demonstrate:

```rng = default_rng(0)

df = DataFrame(
index=(idx := Index(["a", "b", "c"], name='Group').repeat(3)),
data={
'values1': [34, 25, 20, 10, 12, 1, 3, 0, 7],
'values2': [39, 41, 15, 20, 25, 31, 10, 9, 4],
}
)

df
```
values1 values2
Group
a 34 39
a 25 41
a 20 15
b 10 20
b 12 25
b 1 31
c 3 10
c 0 9
c 7 4

DataFrame.groupby(…).agg#

• `.agg` only works with functions that reduce.

• For user defined functions, `.agg` operates on each column independently this means that you cannot operate across columns with `.agg`

When working with multiple columns—as our data set contains—pandas will apply the function to each column within each group separately. This means it cannot use `.groupby(…).agg()` for values from multiple columns in the function passed to `.agg`.

There are many ways to pass functions into `.agg`. One is to pass multiple reduction functions as a list. Note that I receive an output for each of my original columns. To read on how else you can pass in functions to `.agg`, check out the `pandas.core.groupby.DataFrameGroupBy.aggregate` documentation

```df.groupby('Group').agg([sum, 'mean', lambda s: False])
```
```/tmp/ipykernel_28498/366122397.py:1: FutureWarning: The provided callable <built-in function sum> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
df.groupby('Group').agg([sum, 'mean', lambda s: False])
/tmp/ipykernel_28498/366122397.py:1: FutureWarning: The provided callable <built-in function sum> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
df.groupby('Group').agg([sum, 'mean', lambda s: False])
```
values1 values2
sum mean <lambda_0> sum mean <lambda_0>
Group
a 79 26.333333 False 95 31.666667 False
b 23 7.666667 False 76 25.333333 False
c 10 3.333333 False 23 7.666667 False

If we pass a user defined function that does not reduce, we see this error:

```df.groupby('Group')['values1'].agg(lambda s: s * 2)
```
```---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [6], in <cell line: 1>()
----> 1 df.groupby('Group')['values1'].agg(lambda s: s * 2)

File ~/.pyenv/versions/dutc-site/lib/python3.10/site-packages/pandas/core/groupby/generic.py:294, in SeriesGroupBy.aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
291     return self._python_agg_general(func, *args, **kwargs)
293 try:
--> 294     return self._python_agg_general(func, *args, **kwargs)
295 except KeyError:
296     # KeyError raised in test_groupby.test_basic is bc the func does
297     #  a dictionary lookup on group.name, but group name is not
298     #  pinned in _python_agg_general, only in _aggregate_named
299     result = self._aggregate_named(func, *args, **kwargs)

File ~/.pyenv/versions/dutc-site/lib/python3.10/site-packages/pandas/core/groupby/generic.py:327, in SeriesGroupBy._python_agg_general(self, func, *args, **kwargs)
324 f = lambda x: func(x, *args, **kwargs)
326 obj = self._obj_with_exclusions
--> 327 result = self._grouper.agg_series(obj, f)
328 res = obj._constructor(result, name=obj.name)
329 return self._wrap_aggregated_output(res)

File ~/.pyenv/versions/dutc-site/lib/python3.10/site-packages/pandas/core/groupby/ops.py:864, in BaseGrouper.agg_series(self, obj, func, preserve_dtype)
857 if not isinstance(obj._values, np.ndarray):
858     # we can preserve a little bit more aggressively with EA dtype
859     #  because maybe_cast_pointwise_result will do a try/except
860     #  with _from_sequence.  NB we are assuming here that _from_sequence
861     #  is sufficiently strict that it casts appropriately.
862     preserve_dtype = True
--> 864 result = self._aggregate_series_pure_python(obj, func)
866 npvalues = lib.maybe_convert_objects(result, try_float=False)
867 if preserve_dtype:

File ~/.pyenv/versions/dutc-site/lib/python3.10/site-packages/pandas/core/groupby/ops.py:890, in BaseGrouper._aggregate_series_pure_python(self, obj, func)
886 res = extract_result(res)
888 if not initialized:
889     # We only do this validation on the first iteration
--> 890     check_result_array(res, group.dtype)
891     initialized = True
893 result[i] = res

File ~/.pyenv/versions/dutc-site/lib/python3.10/site-packages/pandas/core/groupby/ops.py:88, in check_result_array(obj, dtype)
84 if isinstance(obj, np.ndarray):
85     if dtype != object:
86         # If it is object dtype, the function can be a reduction/aggregation
87         #  and still return an ndarray e.g. test_agg_over_numpy_arrays
---> 88         raise ValueError("Must produce aggregated value")

ValueError: Must produce aggregated value
```

DataFrame.groupby(…).transform#

`.transform` works with both non-reducing and reducing functions.

A groupby-transform operation ensures that the `DataFrame` does not change shape after the operation. The original `Index` remains intact and is used to index the resultant `DataFrame`.

Non-reducing function

```df.groupby('Group').transform(lambda s: s.cumsum())
```
values1 values2
Group
a 34 39
a 59 80
a 79 95
b 10 20
b 22 45
b 23 76
c 3 10
c 3 19
c 10 23

Reducing function

However, when a reducing function is used, the outputted value maps back to the shape of the original input to ensure an N → N grouped operation.

```df.groupby('Group').transform(sum)
```
```/tmp/ipykernel_28498/2923381509.py:1: FutureWarning: The provided callable <built-in function sum> is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
df.groupby('Group').transform(sum)
```
values1 values2
Group
a 79 95
a 79 95
a 79 95
b 23 76
b 23 76
b 23 76
c 10 23
c 10 23
c 10 23

DataFrame.groupby(…).apply#

`groupby(…).apply` is the catch-all for pandas groupby operations. For user-defined functions, `.apply` works as follows:

• N → M transformations (including N → 1 and N → N)

• It works with entire sub-frames, instead of on a per column per group basis like `.agg` and `.transform`

The actual mechanics of `.apply` are similar to the above, except that pandas struggles to make the same assumptions as it can with the `.agg` and `.transform` methods. This enables maximum flexibility for passing user defined functions to a `groupby` operation, at the cost of some performance speed.

Wrap Up#

That takes us to the end of this post! This should help orient you towards how you can better leverage `.agg` and `.transform` in your work, and only use `.apply` when you really need it.

In summary, you can think of groupby operations with the following table:

Talk to you all next time!