Parsing Unconventional Text

Hey everyone! I'm back to playing around with Polars again and wanted to share a fun problem I came across on Stack Overflow. In this problem, the OP had some raw textual data in a key-value paired format. However, this format is not one that is commonly supported, like JSON. This means we get to write a custom parser!

Data

from io import StringIO

buffer = StringIO('''
Technology=0.207;Financials=0.090;Health Care=0.084;Consumer Discretionary=0.069
Financials=0.250;Health Care=0.200;Consumer Staples=0.150;Industrials=0.400
'''.strip())

We need to read in this data and create a column for each of these fields, appropriately filling in null values for any row that is missing a field that is previously or later defined.

For text-file parsing problems where you have an unconventional format, I recommend writing a parser in pure Python as this will provide you with the flexibility you need to solve the problem and add additional changes if the format changes in various ways.

Pure Python

The Python solution will, of course, iterate through each line in the file and parse each value out to store into a dictionary. When we do this, we have two special cases to worry about:

In each row, we have to ask:

Is this a field that we have NEVER encountered before?
Were there fields that we have encountered before that we did not encounter in the current row?

You'll see the approach taken for each below:

import pandas as pd
from collections import defaultdict
buffer.seek(0)

data = defaultdict(list)
for i, row in enumerate(buffer, start=1):
    for entry in row.strip().split(';'):
        field, value = entry.split('=', maxsplit=1)
        value = float(value)
        if field not in data: # ①
            data[field].extend([None] * (i-1))
        data[field].append(value)
        
    for values in data.values(): # ②
        values.extend([None] * (i - len(values)))

pd.DataFrame(data)

	Technology	Financials	Health Care	Consumer Discretionary	Consumer Staples	Industrials
0	0.207	0.09	0.084	0.069	NaN	NaN
1	NaN	0.25	0.200	NaN	0.15	0.4

The Python approach isn’t too tricky, but I wouldn’t expect that it would be the first thing many people would reach for. In my consulting work, I often encounter poorly parsed data already in some sort DataFrame-like output. Then I am asked, "How can I parse this into separate columns?" Of course, I advise them that they should do it in pure Python, but there is a tendency to reach for "whatever solution we can arrive at the fastest and that will make the most sense to someone else in the future."

This is where I provide feedback in some (occasionally odd) DataFrame contortions.

Pandas Parsing

Thankfully though, my friend Mohamad Hussein, has pointed out that we don’t need to perform all of this parsing by hand and can instead leverage the wide variety of contructors that pandas has to offer. Specifically pandas can handle a list of dictionaries with differing fields quite well:

buffer.seek(0)
parsed_data = [ # pull into concrete list for reuse
    dict(i.strip().split("=") for i in ln.split(";"))
    for ln in buffer
]

display(
    pd.DataFrame(parsed_data, dtype=float),               # ⓐ __init__ constructor
    pd.DataFrame.from_records(parsed_data).astype(float), # ⓑ .from_records concstructor
)

	Technology	Financials	Health Care	Consumer Discretionary	Consumer Staples	Industrials
0	0.207	0.09	0.084	0.069	NaN	NaN
1	NaN	0.25	0.200	NaN	0.15	0.4

	Technology	Financials	Health Care	Consumer Discretionary	Consumer Staples	Industrials
0	0.207	0.09	0.084	0.069	NaN	NaN
1	NaN	0.25	0.200	NaN	0.15	0.4

A quick peek under the hood reveals that both of these constructors take a very similar path to reaching the resultant DataFrame. Regardless of optimizations that are made/can be made this is a very concise approach that maintains high readability.

Polars (2 passes)

We can solve this problem using Polars in two passes on the data:

Find all of the unique field names
Use those fields to match patterns and extract the corresponding values

import polars as pl
from polars import col
buffer.seek(0)

df = pl.DataFrame({
    'raw': buffer.readlines()
})

fields = df.select( # ①
    col('raw').str.extract_all(r'[A-Za-z ]+')
    .explode()
    .unique()
)

df.select( # ②
    pl.col('raw').str.extract(fr"{field}=(\d+\.?\d*)")
    .cast(pl.Float64)
    .alias(field)
    for field in fields['raw']
)

shape: (2, 6)

Health Care	Technology	Financials	Consumer Staples	Industrials	Consumer Discretionary
f64	f64	f64	f64	f64	f64
0.084	0.207	0.09	null	null	0.069
0.2	null	0.25	0.15	0.4	null

Polars (via JSON)

For the final solution here, I am contorting the data in Polars in order to make it resemble JSON. I can then rely on the JSON parsing mechanics in Polars to do the rest of the work for me. To put all of this text into a JSON-compatible format, I need to...

replace ';' and '=' with ',' and ':' (respectively)
add quotes around the field names: Technology → "Technology"
surround the entire string in a set of curly braces
parse the JSON string

import polars as pl
from polars import col
buffer.seek(0)

df = pl.DataFrame({
    'raw': buffer.readlines()
})

(
    df.select(
        pl.format(
            '{{}}', # ③
            col('raw').str.replace_many([';', '=',], [',', ':']) # ①
            .str.replace_all(r'([a-zA-Z ]+)', r'"$1"') # ②
        )
        .str.json_decode() # ④
        .name.keep()
    )
    .unnest('raw')
)

shape: (2, 6)

Technology	Financials	Health Care	Consumer Discretionary	Consumer Staples	Industrials
f64	f64	f64	f64	f64	f64
0.207	0.09	0.084	0.069	null	null
null	0.25	0.2	null	0.15	0.4

Wrap-Up

There we have it: many ways to tackle this problem in both pure Python and Polars.

What do you think about my approach? Anything you’d do differently? Something not making sense? Let me know on the DUTC Discord server.

Hope you enjoyed this week's blog post—talk to you next time!