Table of Contents

Is Your Code Too DRY?

I've seen principles like "Don't Repeat Yourself" (DRY) shared across communities, which can be a useful consideration for some programmers. While this can be a helpful shortcut, what happens when nuance is neglected in more complicated situations? Can principles like DRY still be effective?

While adages like this offer valuable wisdom, they can often oversimplify the unique context of each situation. Worse than making a mistake due to rigid adherence to an acronym, we risk losing the ability to think critically about our code and make informed decisions without relying on these rules.

An Adage for Every Occasion: Wisdom in a Few Words

Imagine you’re hosting a big family dinner. The clock is ticking, and the kitchen is chaotic. You rush through chopping vegetables, skip steps in the recipe, and ignore the timing of the dishes. When it’s time to serve, the vegetables are unevenly cooked, and the roast is burnt on the outside but raw on the inside. This is a perfect example of how "haste makes waste." Rushing through a task might seem like a way to save time, but it often leads to mistakes that take more time to fix. Taking your time, following each step carefully, and paying attention to the details would have made the whole process smoother, and the meal much better.

But there’s also the saying, "He who hesitates is lost," which suggests that sometimes quick decisions are necessary. If you hesitate to get started, the dinner might have been delayed even further. Sometimes, speed is important—delaying action can lead to missed opportunities or results that aren’t up to par. So of course, the key is finding the right balance: knowing when to move quickly and when to slow down, and understanding the situation at hand so you can make the best decision.

Contradictory adages like the above happen all of the time, and in my everyday life, I make my decisions somewhere in the middle of these guidances, falling back to a justifiable, reasonable decision.

So, how can one find such a balance? Let's look at some tools I employ to moderate my time.

Premature Abstraction

# Process A
a_data = [1, 2, 3, 4]                  # ① load
a_transform = [x ** 2 for x in a_data] # ② process
a_result = sum(a_transform)
print(f'{a_result = }')                # ③ report


# Process B
b_data = [5, 6, 0, -1]                 # ① load
b_transform = [x ** 2 for x in b_data] # ② process
b_result = sum(b_transform)
print(f'{b_result = }')                # ③ report

# Process C
c_data = [-10, 8 , 4, 0]               # ① load
c_transform = [x ** 2 for x in a_data] # ② process
c_result = sum(c_transform)
print(f'{c_result = }')                # ③ report
a_result = 30
b_result = 62
c_result = 30

Let's remove the repetition in this code. After reading through it, one should find two independent points of repetition across the A and B examples:

  1. The program structure: load → process → report

  2. The transformation: summing the squared values

Both of these highlight repetitions that exist in our code. Strictly following the DRY principles, we may refactor to something akin to the following:

def sum_of_squares(values):
    return sum(x ** 2 for x in values)

# ① load
datasets = {
    'a': [1, 2, 3, 4],
    'b': [5, 6, 0, -1],
    'c': [-10, 8 , 4, 0]
}

# ② process
results = {label: sum_of_squares(values) for label, values in datasets.items()}

# ③ report
for label, result in results.items():
    print(f'{label}_result → {result:>3}')
a_result →  30
b_result →  62
c_result → 180

Comparing this against the aforementioned identified repetitions:

  1. The program structure: load → process → report

We structure the data into a dictionary after/as part of loading, enabling us to perform uniform actions such as processing and reporting.

  1. The transformation: summing the squared values

Factor out the transformation to its own function that we can reuse.

At first glance, this seems perfect—but only for a world where all datasets are processed the same way.

Hitting a Moving Target

What if we're informed that we're processing the values in dataset 'b' and 'c' incorrectly? Instead of taking the sum of squares, we need to take the sum of cubes. As such, we need to either undo some of the refactoring we've already done, or do something much worse: apply a dirty patch (a less-than-ideal quick fix).

def process_data(label, values):
    if label.casefold() in ('b', 'c'):
        return sum(x ** 3 for x in values)
    return sum(x ** 2 for x in values)

datasets = datasets = {
    'a': [1, 2, 3, 4],
    'b': [5, 6, 0, -1],
    'c': [-10, 8 , 4, 0]
}
results = {
    label: process_data(label, values) for label, values in datasets.items()
}
for label, result in results.items():
    print(f'{label}_result → {result:>5}')
a_result →    30
b_result →   340
c_result →  -424

The code we wrote solves the problem at hand, but it damages the extensibility of the broader program at hand.

Why? Well we have coupled process_data to the specific dataset label, and therefore it has an ambiguous purpose.

At this point, having this logic tucked inside of a function decreases the code's readability as process_data becomes a black box for us to understand. In this black box, we include flow control mechanics that dictate how the data will be processed. And because this function now encodes both how and when to apply certain logic, future readers need to mentally decode that logic just to see what’s happening.

To improve the readability, we can either bring the control flow mechanic to the superficial scripting level or we can factor out the need for flow control. Once we do this, it becomes obvious that we don’t actually want a single process_data function, but instead, we want separate functions that compose all of the different ways we intend to process data. In the below example, I opted to remove the flow-control element to replace it with a dictionary. This enables me to pair specific datasets to specific modes of processing keeping this relationship both declarative and at the most superficial level of my script, making it more obvious to those reading it.

from collections import defaultdict

def sum_of_cubes(values):
    return sum(x ** 3 for x in values)

def sum_of_squares(values):
    return sum(x ** 2 for x in values)

datasets = datasets = {
    'a': [1, 2, 3, 4],
    'b': [5, 6, 0, -1],
    'c': [-10, 8 , 4, 0]
}
processors = defaultdict(
    lambda: sum_of_squares, # default (in this example, just 'a')
    {
        'b': sum_of_cubes,  # special case 'b'
        'c': sum_of_cubes,  # special case 'c'
    }
)

results = {
    label: processors[label](values) for label, values in datasets.items()
}
for label, result in results.items():
    print(f'{label}_result → {result:>4}')
a_result →   30
b_result →  340
c_result → -424

Wrap-Up

The DRY principle is practical, but it's not infallible. Refactoring too early or too aggressively can make your code harder to extend, reason through, or debug, especially when differences that seem trivial today become important tomorrow. Sometimes, a little repetition keeps your intent clear and your code adaptable. When you want to eliminate repetition, remember to ask yourself: Did these codes exhibit intentional or coincidental repetition? Will the refactored code be extensible in this situation?

If you can answer those two questions, then you have better insight as to when you should eliminate repetition than the “Don't Repeat Yourself” adage would ever provide.

What do you think about nuance and repetition? Let me know your thoughts on the DUTC Discord server.

Table of Contents
Table of Contents