When do I Write a Function?#
Hey all, this week I wanted to visit a topic that comes up across many of the courses that we teach:
When do I write a function?
On the surface this question seems fairly straightforward, and, if we adhere to the “DRY” principle (don’t repeat yourself), it’s tempting to conclude that we want to write a function (or loop) any time we have repetition in our code.
This means we would want to take something like this:
from pandas import read_csv, NA
df1 = (
read_csv('data_01.csv')
.rename(columns=str.lower)
.replace(999, NA)
.fillna(0)
.groupby('entity').sum()
)
df2 = (
read_csv('data_02.csv')
.rename(columns=str.lower)
.replace(999, NA)
.fillna(0)
.groupby('entity').sum()
)
And turn it into this:
def preprocess(path):
return (
read_csv(path)
.rename(columns=str.lower)
.replace(999, NA)
.fillna(0)
.groupby('entity').sum()
)
df1 = preprocess(path)
df2 = preprocess(path)
which is much shorter right?
Well, we also need to consider the nature of the problem at hand here. Are these
two data files ALWAYS going to need the same exact preprocessing steps? It seems safe to assume that two
different sources will each need different treatments. What if we need to replace the value 'NAN'
in data_02.csv
with an actual null-value NA
, but data_01.csv
does not require that treatment?
We would need to begin introducing new modalities into our function:
def preprocess(path, na_value=999):
return (
read_csv(path)
.rename(columns=str.lower)
.replace(na_value, NA)
.fillna(0)
.groupby('entity').sum()
)
df1 = preprocess(path)
df2 = preprocess(path)
This is a minor change for a fairly minor problem, but what if the change I need to make
applies to one file but not another? For example, now I need to combine two columns
in df2
but not in df1
BEFORE the .groupby(…).sum()
.
Do I return from this preprocessing function early? But, then I would need to write
.groupby(…).sum()
TWICE, and that’s just too much. We need to introduce another
modality.
def preprocess(path, na_value=999):
df = (
read_csv(path)
.rename(columns=str.lower)
.replace(na_value, NA)
.fillna(0)
)
if path == 'data_02.csv':
df['agg'] = df['col1'] * df['col2']
return df.groupby('entity').sum()
But that doesn’t seem right either. Should I be watching for a single file name inside of a function? Surely that won’t generalize well. Maybe I need a flag here to specify if I want the aggregate column.
def preprocess(path, na_value=999, to_agg=None):
df = (
read_csv(path)
.rename(columns=str.lower)
.replace(na_value, NA)
.fillna(0)
)
if to_agg is not None:
df['agg'] = df[to_agg].mul(axis='columns')
return df.groupby('entity').sum()
Yet another modality! Every time I want to ‘intervene’ and do something differently to one file, but not another; I need to add another argument to this function thus increasing its testing space and complexity.
Instead, we should rewind and ask ourselves whether these two data files had intentional repetition, or coincidental repetition. The former implies that these two data files should be preprocessed in the exact same manner and that this assumption will hold over time.
However, in this case we can see that we have stumbled onto a case of coincedental repetition— meaning that these two files just happened to have the same preprocessing steps at this point in time—but we do not have the expectation that this will remain true into the future.
We can use this concept of intentionality to guide us ‘when to write a function.’
A Note From James Powell#
At this point, James Powell had something to say on the subject. Here are a few notes he made:
We can consider writing functions to be a ‘normalization’ process for computations. In the first draft of our code, there will be repeated computational steps, and where this repetition is intentional, we have the risk of ‘update anomaly.’
An update anomaly is often the result of a ‘fact’ being represented in multiple, disconnected places. For example, if a government department stores citizens’ addresses in two different databases, one for income taxation and one for property taxation, then, when people move, we have to make sure that their addresses are updated in both places. If these two databases are not connected through some mechanism—if someone has to manually perform the updates in both places—then there is risk of an anomaly, where one database shows the correct ‘fact’ and the other does not. The typical solution to this is to establish a ‘single source of truth.’
In a similar sense, if our code contains the same computational steps repeated in multiple places, then this repetition could be ‘coïncidental’ or it could be ‘intentional.’ If it is intentional, then that means that, if one instance of this code changes, all of them must change in tandem, or else there will be anomalies. If it is coïncidental, then the various instances may be expected to diverge over time—that they were originally identical does not mean we should enforce that they must always be identical.
In the case of ‘intentional’ repetition, we want to write a function. A function provides us with a ‘single source of truth’ for how to perform some grouping of computational steps. If something about this computation changes, the change is made to the function and all call-sites for that function correctly update in tandem.
In the case of ‘coïncidental’ repetition, if we write a function, then, when one call-site of that function changes, we will have to update the function to accommodate this change without disrupting the other call-sites. This typically means introducing modalities with defaults. It’s often the case that writing functions to eliminate coïncidental repetition leads to functions that take a sprawling number of additional keyword arguments and whose bodies become an intractable mess of branching logic.
The distinction between ‘intentional’ and ‘coïncidental’ repetition is not necessarily obvious, but there are some common cases where the repetition is likely to be ‘coïncidental.’ We can ask ourselves ‘how might the logic supported by this computation change over time; will this introduce modalities; and will all users uniformly be aware of, care about, and make equally distributed choices regarding these modalities’? For example, in code that interacts with the ‘outside’ world by reading and parsing a data file, we may anticipate that the data format may change. Will this change be something that will happen to all users? Will all users want to make choices regarding this change? (In a non-pure language, this thinking also provides us with motivation for ‘superficializing’ out interactions with external state.)
Wrap-Up#
Thanks for tuning in this week! I hope you all take this to heart and think a little more critically about the code you write. I have found that sticking to principles without contemplating them leads to inappropriate application. Make your own guidelines, justify them, and apply them.
Until next time!