Parsing Unconventional Text#

Hey everyone! I’m back to playing around with Polars again and wanted to share a fun problem I came across on Stack Overflow. In this problem, the OP had some raw textual data in a key-value paired format. However, this format is not one that is commonly supported, like JSON. This means we get to write a custom parser!

Data#

from io import StringIO

buffer = StringIO('''
Technology=0.207;Financials=0.090;Health Care=0.084;Consumer Discretionary=0.069
Financials=0.250;Health Care=0.200;Consumer Staples=0.150;Industrials=0.400
'''.strip())

We need to read in this data and create a column for each of these fields, appropriately filling in null values for any row that is missing a field that is previously or later defined.

For text-file parsing problems where you have an unconventional format, I recommend writing a parser in pure Python as this will provide you with the flexibility you need to solve the problem and add additional changes if the format changes in various ways.

Pure Python#

The Python solution will, of course, iterate through each line in the file and parse each value out to store into a dictionary. When we do this, we have two special cases to worry about:

In each row, we have to ask:

  1. Is this a field that we have NEVER encountered before?

  2. Were there fields that we have encountered before that we did not encounter in the current row?

You’ll see the approach taken for each below:

import pandas as pd
from collections import defaultdict
buffer.seek(0)

data = defaultdict(list)
for i, row in enumerate(buffer, start=1):
    for entry in row.strip().split(';'):
        field, value = entry.split('=', maxsplit=1)
        value = float(value)
        if field not in data: # ①
            data[field].extend([None] * (i-1))
        data[field].append(value)
        
    for values in data.values(): # ②
        values.extend([None] * (i - len(values)))

pd.DataFrame(data)
Technology Financials Health Care Consumer Discretionary Consumer Staples Industrials
0 0.207 0.09 0.084 0.069 NaN NaN
1 NaN 0.25 0.200 NaN 0.15 0.4

The Python approach isn’t too tricky, but I wouldn’t expect that it would be the first thing many people would reach for. In my consulting work, I often encounter poorly parsed data already in some sort DataFrame-like output. Then I am asked, “How can I parse this into separate columns?” Of course, I advise them that they should do it in pure Python, but there is a tendency to reach for “whatever solution we can arrive at the fastest and that will make the most sense to someone else in the future.”

This is where I provide feedback in some (occasionally odd) DataFrame contortions.

Polars (2 passes)#

We can solve this problem using Polars in two passes on the data:

  1. Find all of the unique field names

  2. Use those fields to match patterns and extract the corresponding values

import polars as pl
from polars import col
buffer.seek(0)

df = pl.DataFrame({
    'raw': buffer.readlines()
})

fields = df.select( # ①
    col('raw').str.extract_all(r'[A-Za-z ]+')
    .explode()
    .unique()
)

df.select( # ②
    pl.col('raw').str.extract(fr"{field}=(\d+\.?\d*)")
    .cast(pl.Float64)
    .alias(field)
    for field in fields['raw']
)
shape: (2, 6)
FinancialsHealth CareConsumer StaplesConsumer DiscretionaryIndustrialsTechnology
f64f64f64f64f64f64
0.090.084null0.069null0.207
0.250.20.15null0.4null

Polars (via JSON)#

For the final solution here, I am contorting the data in Polars in order to make it resemble JSON. I can then rely on the JSON parsing mechanics in Polars to do the rest of the work for me. To put all of this text into a JSON-compatible format, I need to…

  1. replace ';' and '=' with ',' and ':' (respectively)

  2. add quotes around the field names: Technology"Technology"

  3. surround the entire string in a set of curly braces

  4. parse the JSON string

import polars as pl
from polars import col
buffer.seek(0)

df = pl.DataFrame({
    'raw': buffer.readlines()
})

(
    df.select(
        pl.format(
            '{{}}', # ③
            col('raw').str.replace_many([';', '=',], [',', ':']) # ①
            .str.replace_all(r'([a-zA-Z ]+)', r'"$1"') # ②
        )
        .str.json_decode() # ④
        .name.keep()
    )
    .unnest('raw')
)
shape: (2, 6)
TechnologyFinancialsHealth CareConsumer DiscretionaryConsumer StaplesIndustrials
f64f64f64f64f64f64
0.2070.090.0840.069nullnull
null0.250.2null0.150.4

Wrap-Up#

There we have it: many ways to tackle this problem in both pure Python and Polars.

What do you think about my approach? Anything you’d do differently? Something not making sense? Let me know on the DUTC Discord server.

Hope you enjoyed this week’s blog post—talk to you next time!