Working With Files Deep in Your Code

Hello, everyone! Before we get started, I want to let you know about our upcoming public seminar series, "(Even More) Python Basics for Experts." Join James in this three-session series about (even more) Python basics for experts. He'll tackle what's real, how we can tell it's real, and how we can do less work.

As you may already know, we frequently train corporate teams on topics such as introduction to Python, advanced Python, API design, data analysis, and much more! Our trainings always involve custom curriculum which we tailor to the needs of the team and balance with the expectations of management.

This week, we are holding a course on Python for scripting and automation. At the end of the second day of class, an attendee asked us a question that we've gotten before but never fully written up. The attendee asked: "Why shouldn’t I read from a file in my object’s __init__?"

In response to this, we had James Powell write up a very thoughtful response which he shared on Discord, but I also wanted to share this with our blog readers and newsletter subscribers as well. So, here is his response:

It is generally a bad idea to open a file within a class’s __init__.

from tempfile import TemporaryDirectory
from pathlib import Path
from textwrap import dedent
from csv import reader

class T:
    def __init__(self, filename):
        self.data = {}
        with open(filename) as f:
            for name, value in reader(f):
                self.data[name] = int(value)

    def __repr__(self):
        return f'T({self.data!r})'

if __name__ == '__main__':
    with TemporaryDirectory() as p:
        filename = Path(p) / 'data.csv'
        with open(filename, mode='w') as f:
            print(dedent('''
                abc,123
                def,456
                xyz,789
            ''').strip(), file=f)

        obj = T(filename)
        print(f'{obj = }')

obj = T({'abc': 123, 'def': 456, 'xyz': 789})

Premise: Data Model Methods

There are multiple reasons for this. First, our primary motivation for using “data model” methods is to lean into a “common vocabulary” shared among Python programmers.

For example, if we have a collection type, then it's likely the user will want to know its (non-negative integer) “size.” In the Python vocabulary, this is called len(obj) and is implemented by __len__. Every collection type in the builtins supports __len__, so len(obj) works consistently irrespective of whether we have a list, a tuple, a dict, a set, a frozenset, or a str.

If we were to implement our own collection type, say, something similar to a numpy.ndarray, pandas.Series, or pandas.DataFrame, someone would definitely want to know how big these structures are, and their first intuition won’t be to go to the documentation to figure out whether this is an attribute or a method or whether this is called .size or .length or .num_elements. Instead, they’ll write len(obj) and expect this to work along exactly the same lines as it would on a list.

However, since a numpy.ndarray, a pandas.Series, and a pandas.DataFrame are materially distinct from list, the implementation of __len__ and its exact behavior will differ. There are conventions and rules underlying the behavior of protocols like __len__ (some of which are not documented and instead are discovered through interaction with the builtins or the Python standard library).

It’s fair that the implementation of these protocols can assume a foundational understanding of the types on which they are implemented. For example, len(df) on a pandas.DataFrame gives the number of rows, which is (as a consequence of the design of pandas) the most likely question someone would want answered when asking, “How big is this object?”

from pandas import DataFrame

df = DataFrame({
    'x': [1, 2, 3],
    'y': [4, 5, 6],
})

print(
    f'{len(df)         = }', # number of rows
    f'{len(df.index)   = }', # number of rows
    f'{len(df.columns) = }', # number of columns
    f'{df.shape        = }', # number of (rows, columns)
    f'{df.size         = }', # number of total values (rows×columns)
    sep='\n',
)

len(df)         = 3
len(df.index)   = 3
len(df.columns) = 2
df.shape        = (3, 2)
df.size         = 6

In fact, when implementing the “vocabulary” on a custom type, the meaning's clarity is critical. In the case of a pandas.DataFrame, there are at least four possible answers one could give to “How big is this object?”

the number of rows
the number of columns
the number of (rows, columns)
the total number of values (rows×columns)

If we had to guess the answers every time we saw len(df), we would not only gain nothing by implementing this “vocabulary,” but we'd make our code less readable and understandable.

If we want to implement methods from the object model so they are ambiguous, then we’re probably looking for either...

a unique meaning (e.g., len([1, 2, 3]) has only one possible meaningful answer)
a constrained meaning (e.g., len(obj) must return a non-negative integer smaller than sys.maxsize; therefore, its range of possible meanings is constrained to only interpretations that satisfy these rules)
a privileged meaning (e.g., len(DataFrame(...)) shows the “privileged” nature of the rows of the pandas.DataFrame)
a conventional meaning (e.g., len(obj) on a very commonly-used data type from a popular library may, over time, develop a meaning that is widely known and widely accepted by its community of users)

init as an unambiguous protocol

Consider, then, that __init__ represents the protocol by which we initialize an object. While the meaning of __init__ may be obvious, it raises follow-up questions: initialize this object… from what? In other words, what arguments do I have to pass in?

If our object is designed well, then the answer to these questions should be immediately obvious. An __init__ that allows loading from a filename is likely to be ambiguous because there is usually at least one other option for how to initialize a (largely stateless, primarily data) object: by passing in its constituent data values.

from tempfile import TemporaryDirectory
from pathlib import Path
from textwrap import dedent
from csv import reader

class T:
    def __init__(self, data):
        self.data = data
    def __repr__(self):
        return f'T({self.data!r})'

if __name__ == '__main__':
    with TemporaryDirectory() as p:
        filename = Path(p) / 'data.csv'
        with open(filename, mode='w') as f:
            print(dedent('''
                abc,123
                def,456
                xyz,789
            ''').strip(), file=f)

        data = {}
        with open(filename) as f:
            for name, value in reader(f):
                data[name] = int(value)
        obj = T(data)
        print(f'{obj = }')

obj = T({'abc': 123, 'def': 456, 'xyz': 789})

The __init__ here doesn't do anything interesting. (In fact, we may be encouraged to use a dataclasses.dataclass to avoid this boilerplate!)

However, despite it being “boring,” it’s important that we provide this way to initialize this object, otherwise, our code will become very clumsy.

For example, if we need to write a test for this object, do we want to have to create a file on disk for that test to run? If something like hypothesis is going to parameterize the test, do we want to have some fixture that creates a file and then have our __init__ read from that file? This is very clumsy and is likely to test a lot of things outside of the scope of what we’re interested in (such as file formats, permissions, or disk space availability).

Furthermore, if we are given a file that is in a format this is just very slightly different than the format we anticipate—it is delimited by semicolons instead of commas, for example—do we need to first read in that file, parse it, modify it, write it back out to a temporary file, then pass the temporary file’s filename to our __init__?

Consider that one of the most common sources of “churn” in our code (defining “churn” as changes in requirements that lead to disruptive changes in the code) is changes in input and output formats. This kind of “churn” is particularly problematic because it’s likely that we don’t directly control the input or output formats; thus, the requirements change is likely to happen when we’re least prepared to deal with it. If someone else controls the input format, what stops them from changing it on the week when you’re away on vacation, breaking your code and forcing you to come home early?

We can try to avoid this “churn” through social mechanisms by requiring that changes are announced in advance. Unfortunately, external parties aren't forced to play along. In practice, it is common that these kinds of changes occur when we are least prepared to deal with them.

Superficialization

Because we cannot eliminate these kinds of requirement changes, we must try to reduce the disruption they cause. We may do this through “superficialization”: making the parts of our code that are most likely to change in a way or on a schedule adverse to us as shallow as possible in the structure of our code. (This shallowness generally supports more rapid, less disruptive modification.)

Requirements may not only change, but they may accumulate. Our __init__ that read from a file handled the file contents in CSV format. Why would it be unreasonable for our key stakeholders to ask for the __init__ to support JSON as well?

This will quickly spiral out of control. Our __init__ is doing too much and is rapdily accumulating modalities. Furthermore, for someone using our data type, it may not be clear which modalities to care about or which modal arguments are related.

from tempfile import TemporaryDirectory
from pathlib import Path
from textwrap import dedent
from csv import reader
from json import load, dump

class T:
    def __init__(self, csv_filename=None, json_filename=None, json_key=None):
        if csv_filename is None and json_filename is None:
            raise ValueError('must pass either CSV or JSON filename')
        if csv_filename is not None and json_filename is not None:
            raise ValueError('must pass either CSV or JSON filename, not both')

        self.data = {}
        if csv_filename is not None:
            with open(csv_filename) as f:
                for name, value in reader(f):
                    self.data[name] = int(value)
        elif json_filename is not None:
            with open(json_filename) as f:
                self.data = load(f)
                if json_key is not None:
                    self.data = self.data[json_key]

    def __repr__(self):
        return f'T({self.data!r})'

if __name__ == '__main__':
    with TemporaryDirectory() as p:
        csv_filename = Path(p) / 'data.csv'
        with open(csv_filename, mode='w') as f:
            print(dedent('''
                abc,123
                def,456
                xyz,789
            ''').strip(), file=f)
        json_filename = Path(p) / 'data.json'            
        with open(json_filename, mode='w') as f:
            dump({'abc': 123, 'def': 456, 'xyz': 789}, f)

        obj = T(csv_filename=csv_filename)
        print(f'{obj = }')
        obj = T(json_filename=json_filename)
        print(f'{obj = }')

obj = T({'abc': 123, 'def': 456, 'xyz': 789})
obj = T({'abc': 123, 'def': 456, 'xyz': 789})

Since these modalities are bounded—the input is either a CSV or a JSON file—we can nominally decompose them into individual classmethods:

from tempfile import NamedTemporaryFile
from pathlib import Path
from textwrap import dedent
from csv import reader
from json import load, dump

class T:
    def __init__(self, data):
        self.data = data

    def __repr__(self):
        return f'T({self.data!r})'

    @classmethod
    def from_csv(cls, filename):
        data = {}
        if filename is not None:
            with open(filename) as f:
                for name, value in reader(f):
                    data[name] = int(value)
        return cls(data=data)

    @classmethod
    def from_json(cls, filename, json_key=None):
        with open(json_filename) as f:
            data = load(f)
            if json_key is not None:
                data = data[json_key]
        return cls(data=data)

if __name__ == '__main__':
    with TemporaryDirectory() as p:
        csv_filename = Path(p) / 'data.csv'
        with open(csv_filename, mode='w') as f:
            print(dedent('''
                abc,123
                def,456
                xyz,789
            ''').strip(), file=f)
        json_filename = Path(p) / 'data.json'            
        with open(json_filename, mode='w') as f:
            dump({'abc': 123, 'def': 456, 'xyz': 789}, f)

        obj = T(data={'abc': 123, 'def': 456, 'xyz': 789})
        print(f'{obj = }')

        obj = T.from_csv(filename=csv_filename)
        print(f'{obj = }')

        obj = T.from_json(filename=json_filename)
        print(f'{obj = }')

obj = T({'abc': 123, 'def': 456, 'xyz': 789})
obj = T({'abc': 123, 'def': 456, 'xyz': 789})
obj = T({'abc': 123, 'def': 456, 'xyz': 789})

This makes the complexity more linear: instead of having one __init__ where each additional keyword argument multiplicatively increases our complexity, we have N classmethod factory-functions which merely increase our complexity.

The __init__ above is testable, and is more responsive to requirements changes (since we can introduce additional classmethods as necessary). It also represents the one unique, unambiguous way to initialize this object by passing in its data.

The interaction with the most likely cause of “churn” is moderately superficial, without being unnecessarily repetitive. We could consider writing a staticmethod that performs just data parsing, allowing someone to intermediate the parsing and the object construction, but this may be overkill. After all, we have other options for decomposing the classmethods, and we have a lot of flexibility to find the right approach over time without unnecessarily disrupting the rest of our code.

Wrap-Up

And there you have it: not only why one should avoid working with files in their __init__ methods, but also some solutions to preserve simple interactions for your end-users using classmethod constructor. Have some thoughts on this topic? Then let us know by joining us on Discord!

Working With Files Deep in Your Code

Premise: Data Model Methods

__init__ as an unambiguous protocol

Superficialization

Wrap-Up

Table of Contents

init as an unambiguous protocol