Working With Files Deep in Your Code#
Hello, everyone! Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this three-session series about (even more) Python basics for experts. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.
As you may already know, we frequently train corporate teams on topics such as introduction to Python, advanced Python, API design, data analysis, and much more! Our trainings always involve custom curriculum which we tailor to the needs of the team and balance with the expectations of management.
This week, we are holding a course on Python for scripting and automation. At the end of the second day of class, an attendee asked us a question that we’ve gotten before but never fully written up. The attendee asked: “Why shouldn’t I read from a file in my object’s __init__
?”
In response to this, we had James Powell write up a very thoughtful response which he shared on Discord, but I also wanted to share this with our blog readers and newsletter subscribers as well. So, here is his response:
It is generally a bad idea to open a file within a class’s __init__
.
from tempfile import TemporaryDirectory
from pathlib import Path
from textwrap import dedent
from csv import reader
class T:
def __init__(self, filename):
self.data = {}
with open(filename) as f:
for name, value in reader(f):
self.data[name] = int(value)
def __repr__(self):
return f'T({self.data!r})'
if __name__ == '__main__':
with TemporaryDirectory() as p:
filename = Path(p) / 'data.csv'
with open(filename, mode='w') as f:
print(dedent('''
abc,123
def,456
xyz,789
''').strip(), file=f)
obj = T(filename)
print(f'{obj = }')
obj = T({'abc': 123, 'def': 456, 'xyz': 789})
Premise: Data Model Methods#
There are multiple reasons for this. First, our primary motivation for using “data model” methods is to lean into a “common vocabulary” shared among Python programmers.
For example, if we have a collection type, then it’s likely the user will want to know its (non-negative
integer) “size.” In the Python vocabulary, this is called len(obj)
and is
implemented by __len__
. Every collection type in the builtins
supports
__len__
, so len(obj)
works consistently irrespective of whether we have a
list
, a tuple
, a dict
, a set
, a frozenset
, or a str
.
If we were to implement our own collection type, say, something similar to a
numpy.ndarray
, pandas.Series
, or pandas.DataFrame
, someone would
definitely want to know how big these structures are, and their first intuition
won’t be to go to the documentation to figure out whether this is an attribute
or a method or whether this is called .size
or .length
or .num_elements
.
Instead, they’ll write len(obj)
and expect this to work along exactly the
same lines as it would on a list
.
However, since a numpy.ndarray
, a pandas.Series
, and a pandas.DataFrame
are materially distinct from list
, the implementation of __len__
and its
exact behavior will differ. There are conventions and rules underlying the
behavior of protocols like __len__
(some of which are not documented and
instead are discovered through interaction with the builtins or the Python
standard library).
It’s fair that the implementation of these protocols can assume a foundational
understanding of the types on which they are implemented. For example,
len(df)
on a pandas.DataFrame
gives the number of rows, which is (as a
consequence of the design of pandas) the most likely question someone would
want answered when asking, “How big is this object?”
from pandas import DataFrame
df = DataFrame({
'x': [1, 2, 3],
'y': [4, 5, 6],
})
print(
f'{len(df) = }', # number of rows
f'{len(df.index) = }', # number of rows
f'{len(df.columns) = }', # number of columns
f'{df.shape = }', # number of (rows, columns)
f'{df.size = }', # number of total values (rows×columns)
sep='\n',
)
len(df) = 3
len(df.index) = 3
len(df.columns) = 2
df.shape = (3, 2)
df.size = 6
In fact, when implementing the “vocabulary” on a custom type, the meaning’s clarity is critical. In the case of a
pandas.DataFrame
, there are at least four possible answers one could give to
“How big is this object?”
the number of rows
the number of columns
the number of (rows, columns)
the total number of values (rows×columns)
If we had to guess the answers every time we saw len(df)
, we would not only gain nothing by implementing this “vocabulary,” but we’d make our code less readable and understandable.
If we want to implement methods from the object model so they are ambiguous, then we’re probably looking for either…
a unique meaning (e.g.,
len([1, 2, 3])
has only one possible meaningful answer)a constrained meaning (e.g.,
len(obj)
must return a non-negative integer smaller thansys.maxsize
; therefore, its range of possible meanings is constrained to only interpretations that satisfy these rules)a privileged meaning (e.g.,
len(DataFrame(...))
shows the “privileged” nature of the rows of thepandas.DataFrame
)a conventional meaning (e.g.,
len(obj)
on a very commonly-used data type from a popular library may, over time, develop a meaning that is widely known and widely accepted by its community of users)
__init__
as an unambiguous protocol#
Consider, then, that __init__
represents the protocol by which we initialize
an object. While the meaning of __init__
may be obvious, it raises follow-up questions: initialize this object… from
what? In other words, what arguments do I have to pass in?
If our object is designed well, then the answer to these questions should be
immediately obvious. An __init__
that allows loading from a filename is
likely to be ambiguous because there is usually at least one other
option for how to initialize a (largely stateless, primarily data) object: by
passing in its constituent data values.
from tempfile import TemporaryDirectory
from pathlib import Path
from textwrap import dedent
from csv import reader
class T:
def __init__(self, data):
self.data = data
def __repr__(self):
return f'T({self.data!r})'
if __name__ == '__main__':
with TemporaryDirectory() as p:
filename = Path(p) / 'data.csv'
with open(filename, mode='w') as f:
print(dedent('''
abc,123
def,456
xyz,789
''').strip(), file=f)
data = {}
with open(filename) as f:
for name, value in reader(f):
data[name] = int(value)
obj = T(data)
print(f'{obj = }')
obj = T({'abc': 123, 'def': 456, 'xyz': 789})
The __init__
here doesn’t do anything interesting. (In
fact, we may be encouraged to use a dataclasses.dataclass
to avoid this
boilerplate!)
However, despite it being “boring,” it’s important that we provide this way to initialize this object, otherwise, our code will become very clumsy.
For example, if we need to write a test for this object, do we want to have to
create a file on disk for that test to run? If something like hypothesis
is going to parameterize the test, do we want to have some fixture
that creates a file and then have our __init__
read from that file? This is very
clumsy and is likely to test a lot of things outside of the scope of what we’re
interested in (such as file formats, permissions, or disk space availability).
Furthermore, if we are given a file that is in a format this is just very
slightly different than the format we anticipate—it is delimited by
semicolons instead of commas, for example—do we need to first read in that file, parse it,
modify it, write it back out to a temporary file, then pass the temporary
file’s filename to our __init__
?
Consider that one of the most common sources of “churn” in our code (defining “churn” as changes in requirements that lead to disruptive changes in the code) is changes in input and output formats. This kind of “churn” is particularly problematic because it’s likely that we don’t directly control the input or output formats; thus, the requirements change is likely to happen when we’re least prepared to deal with it. If someone else controls the input format, what stops them from changing it on the week when you’re away on vacation, breaking your code and forcing you to come home early?
We can try to avoid this “churn” through social mechanisms by requiring that changes are announced in advance. Unfortunately, external parties aren’t forced to play along. In practice, it is common that these kinds of changes occur when we are least prepared to deal with them.
Superficialization#
Because we cannot eliminate these kinds of requirement changes, we must try to reduce the disruption they cause. We may do this through “superficialization”: making the parts of our code that are most likely to change in a way or on a schedule adverse to us as shallow as possible in the structure of our code. (This shallowness generally supports more rapid, less disruptive modification.)
Requirements may not only change, but they may accumulate. Our __init__
that
read from a file handled the file contents in CSV format. Why would it be
unreasonable for our key stakeholders to ask for the __init__
to support JSON
as well?
This will quickly spiral out of control. Our __init__
is doing too much and
is rapdily accumulating modalities.
Furthermore, for someone using our data type, it may not be clear which
modalities to care about or which
modal arguments are related.
from tempfile import TemporaryDirectory
from pathlib import Path
from textwrap import dedent
from csv import reader
from json import load, dump
class T:
def __init__(self, csv_filename=None, json_filename=None, json_key=None):
if csv_filename is None and json_filename is None:
raise ValueError('must pass either CSV or JSON filename')
if csv_filename is not None and json_filename is not None:
raise ValueError('must pass either CSV or JSON filename, not both')
self.data = {}
if csv_filename is not None:
with open(csv_filename) as f:
for name, value in reader(f):
self.data[name] = int(value)
elif json_filename is not None:
with open(json_filename) as f:
self.data = load(f)
if json_key is not None:
self.data = self.data[json_key]
def __repr__(self):
return f'T({self.data!r})'
if __name__ == '__main__':
with TemporaryDirectory() as p:
csv_filename = Path(p) / 'data.csv'
with open(csv_filename, mode='w') as f:
print(dedent('''
abc,123
def,456
xyz,789
''').strip(), file=f)
json_filename = Path(p) / 'data.json'
with open(json_filename, mode='w') as f:
dump({'abc': 123, 'def': 456, 'xyz': 789}, f)
obj = T(csv_filename=csv_filename)
print(f'{obj = }')
obj = T(json_filename=json_filename)
print(f'{obj = }')
obj = T({'abc': 123, 'def': 456, 'xyz': 789})
obj = T({'abc': 123, 'def': 456, 'xyz': 789})
Since these modalities are bounded—the input is either a CSV or a JSON file—we
can nominally decompose them into individual classmethod
s:
from tempfile import NamedTemporaryFile
from pathlib import Path
from textwrap import dedent
from csv import reader
from json import load, dump
class T:
def __init__(self, data):
self.data = data
def __repr__(self):
return f'T({self.data!r})'
@classmethod
def from_csv(cls, filename):
data = {}
if filename is not None:
with open(filename) as f:
for name, value in reader(f):
data[name] = int(value)
return cls(data=data)
@classmethod
def from_json(cls, filename, json_key=None):
with open(json_filename) as f:
data = load(f)
if json_key is not None:
data = data[json_key]
return cls(data=data)
if __name__ == '__main__':
with TemporaryDirectory() as p:
csv_filename = Path(p) / 'data.csv'
with open(csv_filename, mode='w') as f:
print(dedent('''
abc,123
def,456
xyz,789
''').strip(), file=f)
json_filename = Path(p) / 'data.json'
with open(json_filename, mode='w') as f:
dump({'abc': 123, 'def': 456, 'xyz': 789}, f)
obj = T(data={'abc': 123, 'def': 456, 'xyz': 789})
print(f'{obj = }')
obj = T.from_csv(filename=csv_filename)
print(f'{obj = }')
obj = T.from_json(filename=json_filename)
print(f'{obj = }')
obj = T({'abc': 123, 'def': 456, 'xyz': 789})
obj = T({'abc': 123, 'def': 456, 'xyz': 789})
obj = T({'abc': 123, 'def': 456, 'xyz': 789})
This makes the complexity more linear: instead of having one __init__
where each
additional keyword argument multiplicatively increases our complexity, we have
N classmethod
factory-functions which merely increase our
complexity.
The __init__
above is testable, and is more responsive to requirements
changes (since we can introduce additional classmethod
s as necessary). It also represents the one unique, unambiguous way to initialize this
object by passing in its data.
The interaction with the most likely cause of “churn” is moderately
superficial, without being unnecessarily repetitive. We could consider writing
a staticmethod
that performs just data parsing, allowing someone to
intermediate the parsing and the object construction, but this may be overkill.
After all, we have other options for decomposing the classmethod
s, and we
have a lot of flexibility to find the right approach over time without
unnecessarily disrupting the rest of our code.
Wrap-Up#
And there you have it: not only why one should avoid working with files in their
__init__
methods, but also some solutions to preserve simple interactions
for your end-users using classmethod
constructor. Have some thoughts on this
topic? Then let us know by joining us on Discord!