Refactoring Your Global Blobs#

We all have to read code that we didn’t write. Sometimes the code we read through is good code, sometimes it is bad code, and sometimes… it is horrendous code. While code quality a is topic that can be tricky to pin down, it can often be identified by emphasizing its maintainability (Can the design be extended easily in the future?), transparency (Is it obvious as to what is going on/manipulated at any given time?), and/or readability (Does the code target the average reading level for your team?).

Welcome back to Cameron’s Corner!

I recently reviewed some analytical code that had numerous metaparameters. These were defined in a dictionary at the top of the script which controlled aspects of the computation. This pattern of global variable usage should be avoided because it prevents us from understanding what each of these functions needs in order to execute. This hinders the maintainability, testability, and readability of our code because we can not know the state of the global at any point during runtime since modifications can be made at any arbitrary point.

variables = {
    'x': 1,
    'y': ['hello', 'world'],
    'z': 12,
    'mu': 42,
    'category': 'something',
}

def f():
    return ' '.join(variables['y'])

def g():
    if variables['category'] == 'something':
        variables['mu'] /= 3
    return variables['mu']

def h():
    z = variables['z']
    def _h():
        return variables['x'] + z
    return _h()

print(
    f'{f() = }',
    f'{g() = }',
    f'{h() = }',
    f'{variables.pop("mu") = }',
    sep='\n'
)
f() = 'hello world'
g() = 14.0
h() = 13
variables.pop("mu") = 14.0

What is wrong here?#

This reliance on global state is problematic because it introduces a lack of transparency and control over how the data is used and modified. Anyone who reads or modifies variables does so without any checks or logging, making it easy for bugs to slip in unnoticed. It becomes difficult to track how the program’s state evolves, which is particularly problematic in larger, more complex codebases.

Global state also makes the program harder to extend and maintain. As your program grows, the more this global dictionary is accessed, the greater the risk of unintended side effects. If another developer (or even future you) modifies variables in an unexpected way, it can introduce subtle bugs that are hard to trace back to their source.

Additionally, when you need to refactor the code to improve structure or clarity, the lack of visibility into how this dictionary is used makes the task more error-prone. You’ll need to comb through the entire codebase to find where this global variable is read from or written to, and that’s where using a proxy can really help.

grep + find/replace#

While using grep or a basic find/replace could seem like a quick solution, it’s not foolproof. If the variables dictionary is ever assigned to another name or alias, grep will miss those references, leading to incomplete refactors. You also run into the problem of false positives: local variables named similarly to the global dictionary could be incorrectly identified, leading to confusion or mistakes in refactoring.

Another major issue is that grep doesn’t capture the broader context in which the variable is used. This makes it impossible to see how the data flows through the program. So while grep might be helpful in small, straightforward scripts, it’s not robust enough for complex projects with multiple layers of abstraction.

Proxying Observability#

A proxy pattern is an excellent first step toward gaining more control over global state. By using a proxy object to wrap our variables dictionary, we can intercept every access or update to the dictionary, making it much easier to trace how it is used throughout the code. This helps to improve observability and ensures that you understand how and when the global state is changing.

With this proxy approach, we define the __getitem__ and __setitem__ methods to log whenever a key is accessed or modified. This allows you to capture detailed information about the location in the code where the change occurred by logging the calling frame using the inspect module. The logframe function we created provides a detailed trace of where the access or mutation happened, including the filename, function name, and line number. This level of detail gives you insights into how the program behaves at runtime, helping you plan a more targeted refactor.

By using collections.UserDict, we ensure that the proxy behaves exactly like a normal dictionary, but with added functionality for logging. This makes it an ideal tool for tracing global state usage without having to change the rest of the program. We also take advantage of Python’s dynamic nature by adding logging only where needed, without interfering with the underlying business logic of the code.

This approach gives us more than a simple grep search ever could—now we can see every time our dictionary is manipulated and, more importantly, by whom and where. This insight is invaluable for refactoring and improving the maintainability of the code.

from collections import UserDict
from inspect import getouterframes, currentframe
from itertools import chain
from pathlib import Path
from io import StringIO
from textwrap import indent

def logframe(frame, buffer=None):
    if buffer is None:
        buffer = StringIO()
        
    template = '{f.filename}:{f.function}:{f.lineno}{code_context}'.format
    frame_iter = reversed(getouterframes(frame))
    
    # Skip past async code dispatch lines (specific to Jupyter)
    for f in frame_iter:
        if f.filename.startswith('/tmp'):
            break
    
    # Log information to buffer
    for f in chain([f], frame_iter):
        buffer.write(template(f=f, code_context=''.join(f.code_context).strip()))
        buffer.write('\n')
    return buffer.getvalue()

class D(UserDict):
    def __getitem__(self, key):
        print(
            f'__getitem__ {key!r}',
            indent(logframe(currentframe().f_back), prefix=' '*2),
            sep='\n'
        )
        return self.__dict__['data'][key]

    def __setitem__(self, key, value):
        print(
            f'__setitem__ {key!r}',
            indent(logframe(currentframe().f_back), prefix=' '*2),
            sep='\n'
        )
        self.__dict__['data'][key] = value

variables = {
    'x': 1,
    'y': ['hello', 'world'],
    'z': 12,
    'mu': 42,
    'category': 'something',
}
variables = D(variables)

print(
    f'{f() = }',
    f'{g() = }',
    f'{h() = }',
    f'{variables.pop("mu") = }',
    sep='\n'
)
__setitem__ 'x'
  /tmp/ipykernel_111115/2129497643.py:<cell line: 50>:50 → variables = D(variables)
  /home/cameron/.pyenv/versions/3.10.2/lib/python3.10/collections/__init__.py:__init__:1090 → self.update(dict)
  /home/cameron/.pyenv/versions/3.10.2/lib/python3.10/_collections_abc.py:update:991 → self[key] = other[key]

__setitem__ 'y'
  /tmp/ipykernel_111115/2129497643.py:<cell line: 50>:50 → variables = D(variables)
  /home/cameron/.pyenv/versions/3.10.2/lib/python3.10/collections/__init__.py:__init__:1090 → self.update(dict)
  /home/cameron/.pyenv/versions/3.10.2/lib/python3.10/_collections_abc.py:update:991 → self[key] = other[key]

__setitem__ 'z'
  /tmp/ipykernel_111115/2129497643.py:<cell line: 50>:50 → variables = D(variables)
  /home/cameron/.pyenv/versions/3.10.2/lib/python3.10/collections/__init__.py:__init__:1090 → self.update(dict)
  /home/cameron/.pyenv/versions/3.10.2/lib/python3.10/_collections_abc.py:update:991 → self[key] = other[key]

__setitem__ 'mu'
  /tmp/ipykernel_111115/2129497643.py:<cell line: 50>:50 → variables = D(variables)
  /home/cameron/.pyenv/versions/3.10.2/lib/python3.10/collections/__init__.py:__init__:1090 → self.update(dict)
  /home/cameron/.pyenv/versions/3.10.2/lib/python3.10/_collections_abc.py:update:991 → self[key] = other[key]

__setitem__ 'category'
  /tmp/ipykernel_111115/2129497643.py:<cell line: 50>:50 → variables = D(variables)
  /home/cameron/.pyenv/versions/3.10.2/lib/python3.10/collections/__init__.py:__init__:1090 → self.update(dict)
  /home/cameron/.pyenv/versions/3.10.2/lib/python3.10/_collections_abc.py:update:991 → self[key] = other[key]

__getitem__ 'y'
  /tmp/ipykernel_111115/2129497643.py:<cell line: 52>:53 → f'{f() = }',
  /tmp/ipykernel_111115/1167105545.py:f:10 → return ' '.join(variables['y'])

__getitem__ 'category'
  /tmp/ipykernel_111115/2129497643.py:<cell line: 52>:54 → f'{g() = }',
  /tmp/ipykernel_111115/1167105545.py:g:13 → if variables['category'] == 'something':

__getitem__ 'mu'
  /tmp/ipykernel_111115/2129497643.py:<cell line: 52>:54 → f'{g() = }',
  /tmp/ipykernel_111115/1167105545.py:g:14 → variables['mu'] /= 3

__setitem__ 'mu'
  /tmp/ipykernel_111115/2129497643.py:<cell line: 52>:54 → f'{g() = }',
  /tmp/ipykernel_111115/1167105545.py:g:14 → variables['mu'] /= 3

__getitem__ 'mu'
  /tmp/ipykernel_111115/2129497643.py:<cell line: 52>:54 → f'{g() = }',
  /tmp/ipykernel_111115/1167105545.py:g:15 → return variables['mu']

__getitem__ 'z'
  /tmp/ipykernel_111115/2129497643.py:<cell line: 52>:55 → f'{h() = }',
  /tmp/ipykernel_111115/1167105545.py:h:18 → z = variables['z']

__getitem__ 'x'
  /tmp/ipykernel_111115/2129497643.py:<cell line: 52>:55 → f'{h() = }',
  /tmp/ipykernel_111115/1167105545.py:h:21 → return _h()
  /tmp/ipykernel_111115/1167105545.py:_h:20 → return variables['x'] + z

__getitem__ 'mu'
  /tmp/ipykernel_111115/2129497643.py:<cell line: 52>:56 → f'{variables.pop("mu") = }',
  /home/cameron/.pyenv/versions/3.10.2/lib/python3.10/_collections_abc.py:pop:954 → value = self[key]

f() = 'hello world'
g() = 14.0
h() = 13
variables.pop("mu") = 14.0

One potential downside is that the proxy will log every single access or update, including the initial setup of the dictionary, which can add unnecessary noise. To mitigate this, we can add a flag to control when logging occurs. By turning off logging during initialization, we reduce clutter and focus only on runtime manipulations that matter for debugging and refactoring purposes.

With this improvement, the proxy becomes optionally verbose.

from collections import UserDict
from inspect import currentframe, getouterframes

class D(UserDict):
    def __init__(self, *args, **kwargs):
        self.log = False
        super().__init__(*args, **kwargs)
        self.log = True

    def __getitem__(self, key):
        if self.log:
            print(
                f'__getitem__ {key!r}',
                indent(logframe(currentframe().f_back), prefix=' '*2),
                sep='\n'
            )
        return self.__dict__['data'][key]
    
    def __setitem__(self, key, value):
        if self.log:
            print(
                f'__setitem__ {key!r}',
                indent(logframe(currentframe().f_back), prefix=' '*2),
                sep='\n'
            )
        self.__dict__['data'][key] = value

variables = {
    'x': 1,
    'y': ['hello', 'world'],
    'z': 12,
    'mu': 42,
    'category': 'something',
}
variables = D(variables)

print(
    f'{f() = }',
    f'{g() = }',
    f'{h() = }',
    f'{variables.pop("mu") = }',
    sep='\n'
)
__getitem__ 'y'
  /tmp/ipykernel_111115/3234464708.py:<cell line: 37>:38 → f'{f() = }',
  /tmp/ipykernel_111115/1167105545.py:f:10 → return ' '.join(variables['y'])

__getitem__ 'category'
  /tmp/ipykernel_111115/3234464708.py:<cell line: 37>:39 → f'{g() = }',
  /tmp/ipykernel_111115/1167105545.py:g:13 → if variables['category'] == 'something':

__getitem__ 'mu'
  /tmp/ipykernel_111115/3234464708.py:<cell line: 37>:39 → f'{g() = }',
  /tmp/ipykernel_111115/1167105545.py:g:14 → variables['mu'] /= 3

__setitem__ 'mu'
  /tmp/ipykernel_111115/3234464708.py:<cell line: 37>:39 → f'{g() = }',
  /tmp/ipykernel_111115/1167105545.py:g:14 → variables['mu'] /= 3

__getitem__ 'mu'
  /tmp/ipykernel_111115/3234464708.py:<cell line: 37>:39 → f'{g() = }',
  /tmp/ipykernel_111115/1167105545.py:g:15 → return variables['mu']

__getitem__ 'z'
  /tmp/ipykernel_111115/3234464708.py:<cell line: 37>:40 → f'{h() = }',
  /tmp/ipykernel_111115/1167105545.py:h:18 → z = variables['z']

__getitem__ 'x'
  /tmp/ipykernel_111115/3234464708.py:<cell line: 37>:40 → f'{h() = }',
  /tmp/ipykernel_111115/1167105545.py:h:21 → return _h()
  /tmp/ipykernel_111115/1167105545.py:_h:20 → return variables['x'] + z

__getitem__ 'mu'
  /tmp/ipykernel_111115/3234464708.py:<cell line: 37>:41 → f'{variables.pop("mu") = }',
  /home/cameron/.pyenv/versions/3.10.2/lib/python3.10/_collections_abc.py:pop:954 → value = self[key]

f() = 'hello world'
g() = 14.0
h() = 13
variables.pop("mu") = 14.0

Wrap-Up#

Refactoring global variables—especially in complex codebases—can be a daunting task. While simple tools like grep can provide basic search functionality, they lack the depth needed to fully understand how and where global state is being manipulated. Proxying the global dictionary offers a more powerful approach, giving you detailed insight into how your program interacts with its global state. By leveraging Python’s object model and inspect module, we can log every access or modification, making the refactoring process more transparent and manageable.

This method provides more than just observability—it highlights potential areas for improvement, reveals hidden dependencies, and helps prevent future bugs by offering a clearer understanding of your code’s behavior. As a first step toward a cleaner, more maintainable system, proxying is a low-effort, high-reward technique that can transform how you approach global state in your code.

By using these tools and techniques, you make your codebase more robust, maintainable, and easier to extend in the long run—ultimately transforming the dreaded global state into a more manageable and observable part of your program.

What do you think about my approach? Do you agree with me? What do you believe makes code good or bad? Let me know on the DUTC Discord server.

Talk to you all next week!