Unwinding Concepts: defaultdict

Welcome back to Cameron's Corner! This week, the example we’re exploring is something I discussed in a recent class, where we covered some objects one will find in Python's collections module.

This particular discussion arose when we got to the collections.defaultdict. While many know it as a convenient tool for handling missing keys, there’s a lot more to it than meets the eye. With this blog post, I am not aiming to tell you about how the defaultdict works, but instead, I want to show you the concepts that build up to making the design of the defaultdict near obvious. In our classes, this is how I would teach you the defaultdict—not just by showing you its behaviors, but by layering concepts that build up to how the defaultdict works.

Teaching Python Through Depth

When I teach Python, I aim for more than just conveying what tools like defaultdict do. I intend to show not just what Python offers but demonstrate important concepts and develop an understanding as to why they exist. This way, you don’t just learn that defaultdict exists—you gain an understanding of its inner workings and all of the concepts that support how it works. From there, you can uncover features that make Python an incredibly powerful programming language, one with much more nuance than you might guess from a surface-level view.

I like using the collections.defaultdict as an example. At first glance, it’s just a dictionary with default values—a handy convenience. But when you explore its implementation, you find Python’s deeper strengths:

Basic Python data structures: Gaining a deeper appreciation of the dictionary and how it serves as a foundation for many Python idioms and solutions.
Functions as First-Class Objects: defaultdict doesn’t just take a value as a default, it takes a function. This opens up a large amount of flexibility, letting you generate dynamic defaults on demand. Understanding this requires appreciating the distinction between a function object and a function call, a concept at the heart of Python's design.
Python’s Data Model: Diving into how defaultdict works leads you to Python’s data model—methods like __getitem__ and __setitem__ that let objects behave like dictionaries. Once you grasp these protocols, you can build your own custom objects that work seamlessly with Python’s syntax.

A brief aside. Notice that the above concepts that I am aiming to convey are all incremental in terms of complexity. Depending on the student I am working with, I may not mention the finer points and may just keep the topic at a more basic/appreciable level. Understanding basic Python data structures helps us address why a KeyError might occur in the first place. Treating functions as objects requires leads to a strong understanding of how the default value is created and can lead to other advanced concepts like decorators. Then, finally, Python's data model. This is something I would discuss with some more advanced students as this level of detail can often overwhelm intermediate Python users.

When you understand tools like defaultdict at this level, you’re not just adding another item to your programming toolbox, you’re building an intuition for Python itself. You see how its principles—like explicitness, readability, and flexibility—interconnect to create a language that feels simple yet powerful.

By the end of this journey, you won’t just know how to use defaultdict; you’ll have a deeper appreciation for why it works the way it does. And, perhaps most importantly, you’ll be equipped to discover Python’s other nuances on your own, with the curiosity and confidence to go beyond the surface. That’s the kind of understanding I aim to foster in every Python concept I teach.

Problem: Sorting Values Into Dictionaries

Imagine you’re working with a collection of filenames, each with a different extension, like .txt, .csv, or .png. The goal is to organize these files into groups based on their extensions, so all .txt files end up in one group, .csv files in another, and so on. This kind of categorization is especially useful when managing large datasets or preparing files for specific workflows. While the task itself seems straightforward, implementing it in Python requires understanding how to work with some basic Python objects.

from random import Random

rnd = Random(0)

extensions = ['.txt', '.md', '.html', '.csv', '.png']
files = [f"file{i}{rnd.choice(extensions)}" for i in range(10)]

d = {}
for f in files:
    suffix = f[f.rfind('.'):]
    d[suffix].append(f)

d

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[1], line 11
      9 for f in files:
     10     suffix = f[f.rfind('.'):]
---> 11     d[suffix].append(f)
     13 d

KeyError: '.csv'

The error is glaring: KeyError. The dictionary d has no idea what to do when you ask it for a missing key. Enter the classic workaround: explicitly create default values.

d = {'.txt': [], '.md': [], '.html': [], '.csv': [], '.png': []}
for f in files:
    suffix = f[f.rfind('.'):]
    d[suffix].append(f)

d

{'.txt': ['file2.txt'],
 '.md': [],
 '.html': ['file3.html', 'file7.html', 'file9.html'],
 '.csv': ['file0.csv', 'file1.csv', 'file5.csv', 'file6.csv', 'file8.csv'],
 '.png': ['file4.png']}

This works fine when you know all the possible keys in advance. But what if you don’t? Then you end up writing a slightly more flexible—but verbose—solution:

d = {}
for f in files:
    suffix = f[f.rfind('.'):]
    if suffix not in d: # check if the key does not exist in our dictionary
        d[suffix] = []  # create a default value
    d[suffix].append(f)

d

{'.csv': ['file0.csv', 'file1.csv', 'file5.csv', 'file6.csv', 'file8.csv'],
 '.txt': ['file2.txt'],
 '.html': ['file3.html', 'file7.html', 'file9.html'],
 '.png': ['file4.png']}

Now we’re checking for missing keys explicitly and adding them on the fly. This is fine, but honestly, it’s a bit of a slog. Wouldn’t it be nice if Python handled this for us? Enter the collections.defaultdict.

collections.defaultdict

The defaultdict comes to us from the collections module, and helps us enact some pre-defined behavior whenever our program encounters a missing key. Here’s the same problem solved with a defaultdict:

from collections import defaultdict

d = defaultdict(list)
for f in files:
    suffix = f[f.rfind('.'):]
    d[suffix].append(f)

d

defaultdict(list,
            {'.csv': ['file0.csv',
              'file1.csv',
              'file5.csv',
              'file6.csv',
              'file8.csv'],
             '.txt': ['file2.txt'],
             '.html': ['file3.html', 'file7.html', 'file9.html'],
             '.png': ['file4.png']})

No more manual key checks. No more boilerplate code. Just a dictionary that knows what to do when you ask it for a key that doesn’t exist.

But wait—how does it work? What exactly is defaultdict doing under the hood? Let’s break it down:

But How Does This Work?

To understand defaultdict, we first need to take a step back and talk about how functions work in Python. Specifically, the distinction between calling a function and working with a function object.

Function calls vs function objects

Here’s a simple function, and despite being called complex_math, it does not actually perform any math operations:

def complex_math():
    return 'called complex_math!'

complex_math()

'called complex_math!'

When you call complex_math(), Python runs the code inside the function and gives you the result. But the function itself—without parentheses—is just an object, like any other object in Python:

print(
    f"{complex_math() = }", # complex_math was called, therefore we return "called complex_math"
    f"{complex_math   = }", # prints the underlying function *object*, complex_math is NOT executed here
    sep="\n"
)

complex_math() = 'called complex_math!'
complex_math   = <function complex_math at 0x773d8768c7c0>

This distinction is crucial. The parentheses mean “call this function.” Without them, you’re just referencing the function object.

Functions Are Objects Too

Callability is a property of functions in the same way that addition is a property of numbers. Here’s an example:

x = 10
y = x

x + y

This works because numbers understand the + operator. But if you try to “call” a number, Python doesn’t know what you mean:

x()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 x()

TypeError: 'int' object is not callable

And you’ll get a familiar error: int is not callable.

Functions Are First-Class Citizens

Functions are objects, which means you can assign them to variables, store them in data structures, and even pass them around as arguments. For example:

cm = complex_math # creates new variable referencing the same underlying function

print(
    f"{complex_math() = }", # can call via the original variable name
    f"{cm()           = }", # can also call through the new variable
    sep='\n',
)

complex_math() = 'called complex_math!'
cm()           = 'called complex_math!'

You can also put functions in a list and iterate over them:

funcs_to_call = [complex_math, complex_math, complex_math]

for func in funcs_to_call:
    print(f"{func() = }")

func() = 'called complex_math!'
func() = 'called complex_math!'
func() = 'called complex_math!'

Or pass them as arguments to other functions:

def complex_math():
    return 'called complex_math!'

def takes_a_func_and_does_nothing(func):
    return func

def takes_a_func_and_calls_it(func):
    return func()

print(
    f"{takes_a_func_and_does_nothing(complex_math) = }",
    f"{takes_a_func_and_calls_it(complex_math)     = }",
    sep="\n"
)

takes_a_func_and_does_nothing(complex_math) = <function complex_math at 0x773d8768e2a0>
takes_a_func_and_calls_it(complex_math)     = 'called complex_math!'

Back to defaultdict

So why does this matter? defaultdict takes a function (not a function call!) as its argument. When you access a missing key, defaultdict calls that function to create a default value:

def create_value():
    return 42

d = defaultdict(create_value)
d

defaultdict(<function __main__.create_value()>, {})

If your default value is something simple, like an empty list, you can just pass the list function itself:

d['a'] # Instead of KeyError, call `create_value`
d

defaultdict(<function __main__.create_value()>, {'a': 42})

Another example:

from collections import defaultdict

d = defaultdict(list)
d['a'] # Instead of raising a KeyError, call `list`
d

defaultdict(list, {'a': []})

And that’s the magic of defaultdict. It turns a lot of repetitive boilerplate code into a single, elegant solution.

from collections import defaultdict

d = defaultdict(list)
for f in files:
    suffix = f[f.rfind('.'):]
    d[suffix].append(f) # If d[suffix] is missing, return an empty list and append to that
                      # If d[suffix] is not missing, return the existing paired value

d

defaultdict(list,
            {'.csv': ['file0.csv',
              'file1.csv',
              'file5.csv',
              'file6.csv',
              'file8.csv'],
             '.txt': ['file2.txt'],
             '.html': ['file3.html', 'file7.html', 'file9.html'],
             '.png': ['file4.png']})

Of course, the above point may already be familiar to some who have been using Python for a little while, but the discussion about defaultdict does not need to stop here. If we want to go even deeper, we can speak about how Python knows what to do when you use simple dictionary[...] syntax. This is when we need to introduce Python’s data mode methods.

Going a Bit Deeper

How does defaultdict actually work? Is it some kind of Python magic? Not quite. Under the hood, it’s using Python’s built-in dictionary protocols, just with a twist. Let’s build a simplified version of defaultdict to understand what data mode methods are invoked when we use our dictionary lookup and assignment syntax.

First, a basic dictionary-like class:

class T:
    def __init__(self):
        self.d = {} # we will proxy a regular dictionary

    def __getitem__(self, key):
        print(f'retrieving value paired with {key!r}')
        return self.d[key]
    
    def __setitem__(self, key, value):
        print(f'setting {key!r}={value!r}')
        self.d[key] = value

t = T()

print(t.d)  # empty dictionary
t['a'] = 10 # t.__setitem__('a', 10)
print(t.d)  # {'a': 10}
t['a']      # t.__getitem__('a')

{}
setting 'a'=10
{'a': 10}
retrieving value paired with 'a'

t['b'] # KeyError!

retrieving value paired with 'b'

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[17], line 1
----> 1 t['b'] # KeyError!

Cell In[16], line 7, in T.__getitem__(self, key)
      5 def __getitem__(self, key):
      6     print(f'retrieving value paired with {key!r}')
----> 7     return self.d[key]

KeyError: 'b'

As you saw above, using square brackets t['a'] invokes that objects .__getitem__ method and assigning to that same syntax (t['a'] = 10) invokes the .__setitem__ method. This means that we can freely modify the behavior of these methods, and if we want to recreate the behavior of our defaultdict, we can add a bit more logic inside of our .__getitem__ to handle KeyErrors.

class T:
    def __init__(self, default_factory):
        self.d = {}
        self.default_factory = default_factory

    def __getitem__(self, key):
        print(f'retrieving value paired with {key!r}')
        try:
            result = self.d[key]
        except KeyError:
            print('→ failed to retrieve, falling back to a default value')
            result = self.default_factory()
            self.d[key] = result
        return result
    
    def __setitem__(self, key, value):
        print(f'setting {key!r}={value!r}')
        self.d[key] = value

t = T(int)

print(t.d)  # empty dictionary
t['a'] = 10 # __setitem__
print(t.d)  # {'a': 10}
t['a']      # __getitem__

{}
setting 'a'=10
{'a': 10}
retrieving value paired with 'a'

t['b']

retrieving value paired with 'b'
→ failed to retrieve, falling back to a default value

print(t.d) # The KeyError that we averted now populates an entry in our dictionary!

{'a': 10, 'b': 0}

Even Deeper

In the above example, we used composition to hold onto a dictionary in our self.d attribute and interfaced with that dictionary by linking the T classes .__setitem__ and .__getitem__ syntax to that of the underlying dictionary. But we can also accomplish this pattern by using inheritance instead of a compositional approach.

When we do this, we should also be aware that our regular Python dictionaries already try to do something when a KeyError is raised! The Python dictionary will attempt to call their own .__missing__ method, however, this method is intentionally not written on the base dictionary class. Instead, Python will try to run the .__missing__ data model method and if it can not find a method by that name, it propagates the KeyError. If it does has a function by that name, then it assumes that this function uses logic to handle what to do in the case of a missing key from our dictionary.

This does mean that we can shorten our previous example, and let our dictionary subclass use its parents' .__getitem__ and .__setitem__ methods while we implement our custom logic inside of the .__missing__ data model method.

from collections import UserDict

class T(UserDict):
    def __init__(self, default_factory, data={}):
        self.default_factory = default_factory
        super().__init__(data)
    def __missing__(self, key):
        value = self.default_factory()
        self[key] = value
        return value

t = T(list)

# dict subclasses call `.__missing__` in the event of a KeyError
# if it is undefined, the error propagates
# since we define it, we can enforce a pre-determined behavior
t['a']

[]

{'a': []}

Wrap-Up

And that's it for our defaultdict! As you can likely tell, this blog post would have been much shorter if I had just shown you the defaultdict- but instead we got around to discussing many key Python concepts. While the defaultdict may help you out in a simple script, being able to envision functions as regular objects and learning about data model methods will let make simpler choices with your code and write more complex programs in a coherent fashion.

At this point, I am certain that we all have some appreciation for the collections.defaultdict as a small but powerful tool. It eliminates boilerplate, makes your code cleaner, and leverages Python’s first-class functions in a natural way. But, by introspecting in how some of these built-in tools work in Python, we can start to appreciate more complicated designs and important concepts.

I would encourage you to use your own curiosity and go discover something in Python! See something and you don't know how it works? Read other blog posts, watch YouTube videos, and inspect source code. These explorations will often open up rabbit holes where you can keep learning new ideas—I am willing to bet that you read through the last example above and skimmed right past the collections.UserDict. Why did I use it? Why not just inherit from Pythons dict?

Should I discuss that in a future blog post or leave it up to you to do a deep dive and learn something new about Python? Let me know on the DUTC Discord.

Talk to you all next time!