Table of Contents

Python's Collections: Built-ins, But Fancier

Hello, everyone! Welcome back to Cameron's Corner! This week, I want to discuss Python’s collections module. Python's collections module is like a secret menu for built-in container types. You know your basic dict, list, set, and tuple, but what if they had their capabilities slightly expanded? That’s exactly what collections does, offering specialized versions of these types that solve common problems with a little extra flair.

Ever wished a dict could populate missing keys? Meet the collections.defaultdict. Need a list that can be used to represent both a stack and a queue? That’s a collections.deque. Want a set that counts occurrences? That’s the collections.Counter. Each of these tools builds on Python’s core data structures, giving you more functionality without reinventing the wheel.

Let’s take a quick tour of these fancy containers and see why they deserve a spot in your Python toolbox.

Built-in Type

collections Enhancement

Description

dict

defaultdict

A dictionary that provides default values for missing keys.

set

Counter

A multiset that counts occurrences of elements.

list

deque

A list-like container optimized for fast appends and pops on both ends.

tuple

namedtuple

A tuple with named fields for better readability.

defaultdict (a fancy dictionary)

A collections.defaultdict is a specialized dictionary that provides auto-vivification, meaning it automatically creates default values for missing keys instead of raising a KeyError. When a key is accessed but doesn’t exist, defaultdict calls a factory function (e.g., int, list, dict, any arbitrary callable) to generate a default value on the fly.

Don't miss out on my previous blog post where I take a deep dive into the concepts that underlie the defaultdict!

Recall that when we fail a lookup in a regular dictionary we observe a KeyError

d = {}
d['a'] # KeyError!
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[1], line 2
      1 d = {}
----> 2 d['a'] # KeyError!

KeyError: 'a'

But, if we use a defaultdict, we pass a function that is called whenever a missing key is encountered. This function call returns a value that is paired with the previously missing key and to the caller as if the KeyError never happened.

from collections import defaultdict

d = defaultdict(int)
d['a'] # Key 'a' does not exist, create entry with value of 0 (result of calling int())

d
defaultdict(int, {'a': 0})

In a practical setting, this allows us to work with a dictionary without ever having to explicitly check for the existence of a particular key.

words = ['apple', 'banana', 'cherry', 'avocado', 'blueberry', 'carrot', 'almond', 'broccoli', 'coconut', 'apricot']

categorized = {}
for w in words:
    first_letter = w[0]
    if first_letter not in categorized: # vivify the value if the key does not exist in the dictionary
        categorized[first_letter] = set()
    categorized[first_letter].add(w)

categorized
{'a': {'almond', 'apple', 'apricot', 'avocado'},
 'b': {'banana', 'blueberry', 'broccoli'},
 'c': {'carrot', 'cherry', 'coconut'}}
from collections import defaultdict

words = ['apple', 'banana', 'cherry', 'avocado', 'blueberry', 'carrot', 'almond', 'broccoli', 'coconut', 'apricot']

categorized = defaultdict(set) # no need to check `if w[0] in categorized`
for w in words:
    categorized[w[0]].add(w)

categorized
defaultdict(set,
            {'a': {'almond', 'apple', 'apricot', 'avocado'},
             'b': {'banana', 'blueberry', 'broccoli'},
             'c': {'carrot', 'cherry', 'coconut'}})

Counter (a fancy set)

A collections.Counter in Python is essentially a multiset: a set that allows duplicate elements and keeps track of their occurrences. Unlike a standard set, which only stores unique elements, a multiset maintains a count of how many times each element appears. Counter implements this behavior by storing elements as dictionary keys and their counts as values. This makes it useful for frequency analysis, histogram generation, and duplicate detection.

One might point out that the Counter employs many dictionary like behaviors: the idea of keys & values and even has common dictionary methods like .get, .pop. In this case I would agree with you that the Counter behaves both like a Python set and a dictionary, but its usage is much more closely aligned with a multiset.

Say we have a list of books borrowed from a library and a list of books returned. A set can help answer some basic questions about which books were involved, but it loses information about how many times each title was borrowed or returned.

borrowed = [
    '1984', 'To Kill a Mockingbird', '1984', 'The Great Gatsby', 
    '1984', 'Moby-Dick', 'Pride and Prejudice', 'The Hobbit'
]

returned = [
    '1984', 'To Kill a Mockingbird', 'The Great Gatsby', 'The Hobbit', 'The Hobbit'
]

print(
    f'{set(borrowed) | set(returned) = }', # all titles
    f'{set(borrowed) & set(returned) = }', # titles that were borrowed and returned
    f'{set(borrowed) - set(returned) = }', # titles that were borrowed but not yet returned
    sep='\n',
)
set(borrowed) | set(returned) = {'Moby-Dick', 'Pride and Prejudice', 'To Kill a Mockingbird', 'The Hobbit', 'The Great Gatsby', '1984'}
set(borrowed) & set(returned) = {'To Kill a Mockingbird', 'The Hobbit', '1984', 'The Great Gatsby'}
set(borrowed) - set(returned) = {'Moby-Dick', 'Pride and Prejudice'}

The use of sets above lets us address many questions quite easily; however, we lose specific information about the quantity of each title. We can address the same questions while preserving this information by using collections.Counter, which acts as a multiset.

from collections import Counter

borrowed = Counter(borrowed)
returned = Counter(returned)

print(
    f'{borrowed + returned = }', # all titles (and their quantity)
    f'{borrowed - returned = }', # titles that were borrowed and subsequently returned
    f'{borrowed - returned = }', # titles that were borrowed but not yet returned
    sep='\n',
)
borrowed + returned = Counter({'1984': 4, 'The Hobbit': 3, 'To Kill a Mockingbird': 2, 'The Great Gatsby': 2, 'Moby-Dick': 1, 'Pride and Prejudice': 1})
borrowed - returned = Counter({'1984': 2, 'Moby-Dick': 1, 'Pride and Prejudice': 1})
borrowed - returned = Counter({'1984': 2, 'Moby-Dick': 1, 'Pride and Prejudice': 1})

Here, Counter allows us to maintain the exact number of times each book was borrowed and returned, giving us richer insights compared to set.

deque (a fancy list)

A collections.deque (double-ended queue) is a versatile data structure that efficiently supports adding and removing elements from both ends. It can function as a stack (LIFO—Last In, First Out), useful for features like browser history, undo functionality, or backtracking algorithms. It also works as a queue (FIFO—First In, First Out), ideal for task scheduling, breadth-first search, or processing requests in order. Compared to a list, deque provides fast append and pop operations from both ends, making it a great choice for either stack or queue-oriented problems.

Stack

A stack follows the Last In, First Out (LIFO) principle. Web browsers use stacks to track browsing history: when you visit a page, it gets pushed onto the stack, and clicking "Back" pops the most recent page off.

A list can be used for this purpose:

history = []

history.append("google.com")
history.append("wikipedia.org")
history.append("github.com")

# click the "Back" button
most_recent_page = history.pop()

print(
    f'{most_recent_page = }',
    f'{history          = }',
    sep='\n'
)
most_recent_page = 'github.com'
history          = ['google.com', 'wikipedia.org']

A deque (double-ended queue) from collections can do the same thing, but it’s optimized for fast appends and pops from both ends:

from collections import deque

history = deque()

history.append("google.com")
history.append("wikipedia.org")
history.append("github.com")

# click the "Back" button
most_recent_page = history.pop()

print(
    f'{most_recent_page = }',
    f'{history          = }',
    sep='\n'
)
most_recent_page = 'github.com'
history          = deque(['google.com', 'wikipedia.org'])

If we needed to just represent a stack, I would stick with a regular list since it works quite well. If your stack can only maintain a maximum size, then a deque passing the maxlen argument would be more appropriate.

queue

A queue follows the First In, First Out (FIFO) principle. For example, customers at a store are served in the order they arrive.

A simple queue can be implemented with a list:

queue = ['Bob', 'Charlie', 'Alice']

queue.append('David') # add customer to queue
current_customer = queue.pop(0) # get first customer

print(
    f'{current_customer = }',
    f'{queue            = }',
    sep='\n',
)
current_customer = 'Bob'
queue            = ['Charlie', 'Alice', 'David']

However, popping from the start of a list (.pop(0)) is inefficient because it requires shifting all elements in the list. Using deque is a more efficient approach:

from collections import deque

queue = deque(['Bob', 'Charlie', 'Alice'])

queue.append('David')         # add customer to queue
queue.appendleft('Elizabeth') # add priority customer to front of queue
current_customer = queue.popleft() # get first customer

print(
    f'{current_customer = }',
    f'{queue            = }',
    sep='\n',
)
current_customer = 'Elizabeth'
queue            = deque(['Bob', 'Charlie', 'Alice', 'David'])

With deque, operations at both ends are O(1), making it ideal for queues where frequent insertions and removals occur.

namedtuple (a fancy tuple)

Regular tuples are often used to store structured data, but they require remembering the meaning of each index, making code less readable and more error-prone. Consider a list of employees stored as tuples:

employees = [
    ("Alice",   30, "Engineering"),
    ("Bob",     25, "Marketing"),
    ("Charlie", 35, "HR"),
    ("David",   28, "Engineering"),
    ("Eve",     40, "Marketing"),
    ("Frank",   32, "HR"),
    ("Grace",   29, "Engineering"),
    ("Hank",    50, "Finance"),
    ("Ivy",     27, "Finance"),
    ("Jack",    45, "Marketing")
]


for emp in employees:
    # code isn’t very readable, we need to memorize what the [2] element of our tuple is
    if emp[2] == 'Marketing':
        print(emp)
('Bob', 25, 'Marketing')
('Eve', 40, 'Marketing')
('Jack', 45, 'Marketing')

Here, emp[2] refers to the department, but this is not obvious when reading the code. Using collections.namedtuple improves clarity:

from collections import namedtuple

Employee = namedtuple('Employee', ['name', 'age', 'department'])

employees = [
    Employee("Alice",   30, "Engineering"),
    Employee("Bob",     25, "Marketing"),
    Employee("Charlie", 35, "HR"),
    Employee("David",   28, "Engineering"),
    Employee("Eve",     40, "Marketing"),
    Employee("Frank",   32, "HR"),
    Employee("Grace",   29, "Engineering"),
    Employee("Hank",    50, "Finance"),
    Employee("Ivy",     27, "Finance"),
    Employee("Jack",    45, "Marketing")
]


for emp in employees:
    # access the field by name (instead of a number/position)
    # much more readable/maintainable
    if emp.department == 'Marketing':
        print(emp)
Employee(name='Bob', age=25, department='Marketing')
Employee(name='Eve', age=40, department='Marketing')
Employee(name='Jack', age=45, department='Marketing')

Using namedtuple makes the intent of the code clear: emp.department is explicit, while emp[2] requires remembering the structure of the tuple.

Wrap-Up

There are a few other honorable mentions in the collections module that build on the dictionary, namely the collections.ChainMap and the collections.OrderedDict. The former allows one to transparently "merge" multiple dictionaries together without overwriting mapping information, while the latter doesn't see much use past Python 3.6 due to the built-in dictionary's preservation of its keys insertion order (though there is one other use case outside of this). I'll leave further exploration of these tools to you!

As you have seen, the collections module in Python provides specialized container types that extend the capabilities of built-in data structures. Whether it's using the Counter for frequency analysis, deque for efficient stack and queue operations, or namedtuple for more readable tuples, these tools help write cleaner, more efficient code that is easier to maintain. By leveraging these specialized containers, you can solve common problems more effectively while keeping your code both expressive and performant without ever needing to reach for a 3rd party library.

What do you think about the collections module—do you love it? Hate it? Let me know on the DUTC Discord server.

Table of Contents
Table of Contents