Table of Contents

Parsing with Argparse

Hello, everyone! Welcome back to Cameron's Corner. This week, I want to share some recent analytical work I'm doing as part of a larger project.

A few weeks ago, I was parsing some log files that recorded shell commands and their outputs. Each entry looked something like this:

CMD: python analyze.py --input data1.csv --threshold 0.5 --verbose
OUT: Starting analysis...
OUT: Found 1347 items
OUT: Done in 3.2s

Each CMD: line was the exact command that was run, followed by a bunch of OUT: lines with whatever the program printed.

The goal was to collect results across hundreds of runs—maybe to see how different thresholds affected runtime, or which inputs failed—but first, I needed to extract those argument values from the CMD: lines.

Initially, I started hacking something together with a few regular expressions. But the moment I hit a command with quoted arguments, flags without values, or positional arguments mixed in, it became a mess. Then it hit me: Python already has a perfectly good command-line parser. If I can reconstruct a similar argparse setup those scripts used, I can reuse it to decode the log lines safely and consistently.

Quick Prototype

When working on ad-hoc scripts, I find it best to start small. It’s much easier to reason about a single example in isolation than to dive straight into the full dataset. In this case, I just wanted to confirm that argparse could parse the same argument strings that appeared in the logs.

from argparse import ArgumentParser
from pathlib import Path
from shlex import split as sh_split

parser = ArgumentParser()
parser.add_argument('--input', type=Path)
parser.add_argument('--threshold', type=float)
parser.add_argument('--verbose', action='store_true')

command_args = '--input data1.csv --threshold 0.5 --verbose'
parser.parse_args(sh_split(command_args))
Namespace(input=PosixPath('data1.csv'), threshold=0.5, verbose=True)

Once that worked, I knew I could trust argparse to handle any combination of flags or values I might encounter in the logs. It correctly interprets types, handles boolean flags, and deals with quoting automatically.

With that confidence, I was ready to scale things up and apply it directly to the raw log file.

A Working Script

The real log files followed a predictable structure: every CMD: line began with the same base command, and the rest of the line was just the arguments passed to it. That means I can loop through each line, look for CMD:, extract the argument string, and feed it through the same parser.

Here’s a minimal working example that uses an in-memory buffer to simulate the log file:

from io import StringIO

buffer = StringIO("""
CMD: python analyze.py --input data1.csv --threshold 0.5 --verbose
OUT: Starting analysis...
OUT: Found 1347 items
OUT: Done in 3.2s
CMD: python analyze.py --input data1.csv --threshold 0.6 --verbose
OUT: Starting analysis...
OUT: Found 894 items
OUT: Done in 2.1s
CMD: python analyze.py --input data1.csv --threshold 0.9 --verbose
OUT: Starting analysis...
OUT: Found 734 items
OUT: Done in 1.8s
""".lstrip())

for ln in buffer:
    ln_type, remainder = ln.strip().split(': ', maxsplit=1)
    
    # if multiple scripts are mixed in, one could deal with that here
    #   organize multiple parsers and have them assigned to specific CMDs via a predicate
    #   since we only have 'analyze.py' we can hard-code that value for now
    if ln_type == 'CMD' and remainder.startswith('python analyze.py'):
        argument_string = remainder.removeprefix('python analyze.py')
        namespace = parser.parse_args(sh_split(argument_string))
        print(namespace)
Namespace(input=PosixPath('data1.csv'), threshold=0.5, verbose=True)
Namespace(input=PosixPath('data1.csv'), threshold=0.6, verbose=True)
Namespace(input=PosixPath('data1.csv'), threshold=0.9, verbose=True)

The output gives you clean, typed namespaces for each run. From here, it’s easy to collect them into a list or DataFrame and start exploring trends across runs.

You could keep extending this idea to handle multiple commands, infer missing arguments, or even reconstruct the commands to rerun failed cases. It’s a small trick, but it turns messy logs into structured data with almost no parsing logic of your own.

Wrap-Up

Parsing logs often feels like one of those chores that should be simple but somehow turns into a swamp of edge cases. What started as a quick regex experiment quickly became a reminder that the right tool already exists; it just wasn’t being used in its usual context.

argparse isn’t only for parsing user input at runtime. It’s also a clean way to interpret existing command lines from scripts, notebooks, or logs. By leaning on it, I didn’t have to reinvent the wheel, and I got type validation, flag handling, and error messages for free.

In the end, this little trick turned a messy pile of text logs into structured, typed data I could analyze just like any other dataset. If you ever find yourself scraping or reviewing old experiment runs, try letting argparse do the heavy lifting. It’s one of those quiet standard library tools that keeps proving its worth.

What are your thoughts? Let me know on the DUTC Discord server!

Table of Contents
Table of Contents