Posts tagged python
Polars Has Inequality Joins!
- 02 October 2024
It finally happened, Polars supports inequality joins — at least, if you are
using version 1.7.0
or later.
This week I want to tackle a familiar problem from
a recent blog post I wrote covering a few
different types of joins and when they are useful. This week, I will tackle
the same problem in Polars, and discuss how this optimization can simplify
our lives and make our compute workloads more manageable by using an inequality
join.
In many data analysis scenarios, simple equality joins—matching rows where column values are exactly the same—aren’t enough. A common use case involves working with time-based data, such as tracking state changes of devices and logging alert events for those devices. Here, you don’t just want to match records by a shared device_id; you need to match based on time intervals, like determining which alerts occurred during a specific state. These kinds of joins, known as inequality joins, are essential in areas like time-series analysis, event-driven systems, and continuous monitoring, where relationships between data points depend on conditions beyond simple equality (i.e., a timestamp falling within a specific range).
DataFrames: one-hot encoding and back
- 25 September 2024
Welcome to this week’s edition of Cameron’s Corner! This week, I want to take a quick dive into one-hot encoding and how we can use it within our popular tabular data tools in Python.
For some background, one-hot encoding is a technique used in machine learning and data preprocessing to convert categorical data into a binary format where each category is represented as a unique vector of 0s and 1s. It is commonly used when dealing with categorical variables in algorithms that require numerical input, such as neural networks and decision trees. Each category is transformed into a binary vector with one element set to 1 and all others set to 0, allowing models to interpret the data without assigning implicit ordinal relationships. One-hot encoding is especially useful in cases like text classification, image recognition, and natural language processing tasks.
Plotting without Weekends
- 18 September 2024
Welcome back to Cameron’s Corner! This week, I wanted to touch on a timeseries-oriented data visualization question that I came across: “How do I plot hourly data, but omit the weekends?”
On the surface, this sounds like a simple data-filtering question. However, we also need to consider the visual elements that go into visualizing these data. Take a peek down at the Original Visualization section to see an example of the chart that generated this question.
“Broadcasting” in Polars
- 11 September 2024
Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to look into performance in Polars when working with multiple DataFrames. We are going to cover “broadcasting”—a term used for aligning NumPy arrays against one another—in Polars. Polars always encourages us to be explicit when aligning and working with multiple DataFrames, but there are a few different conceptual approaches we can take to arrive at the same solution. This is where today’s micro-benchmark will come in. We want to find the fastest approach for this specific problem.
Say we have two LazyFrames, where one is of an undetermined length and the other has a known length of 1 (after collection). We want to be able to perform arithmetic across these two LazyFrames as in the following example:
pandas: Months, Days, and Categoricals
- 28 August 2024
Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to discuss
how we can leverage pandas.Categorical
arrays when working with calendar
months and weekdays. This is a bit of a longstanding issue I’ve
had with pandas. However, I am not the only one who has thought of this, so I have to respect the priorities of the core developers who contribute
their time to this project.
We have some dates stored in a pandas.Series
, and, at some point during our
analytical pipeline, we need to work with individual months and/or weekdays.
We can readily extract the integer value that corresponds with each month
(where January ⇒ 1, December ⇒ 12), OR we can extract the string name of the
month. The same transformations are available for the day of the week(where
Monday ⇒ 0, Sunday ⇒ 6).
Case When: A Welcome Addition to pandas Conveniences
- 21 August 2024
Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to share
about a relatively new pandas.Series
method: case_when.
This function exists to conveniently make replacements across multiple conditions, but
instead of describing what it does, I’d rather show you. Let’s jump into our premise.
Suppose you are a teacher grading papers with six students. They all have different grades, which are the following:
pandas: Within the Restricted Computation Domain
- 14 August 2024
Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to share a fascinating discussion I recently had about the Restricted Computation Domain in pandas. Well, it was actually about outlier detection within groups on a pandas DataFrame, but our conversation quickly turned to other topics.
Let’s take a look at the question and the code that kicked everything off. The original question focused on translating the following block of pandas code into Polars.
pandas groupby Along the Columns
- 31 July 2024
Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to discuss
a deprecation in the pandas API. Unfortunately, the axis=…
argument in
pandas.DataFrame.groupby
is being deprecated. While
it is official, there has been some disagreement within the community on this newest change, primarily because of the conveniences
it offers.
But, what is the axis
parameter, and what workarounds do we have? Let’s take a look:
pandas.concat, explained.
- 10 July 2024
Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to tackle
a pandas question I received concerning the different ways to combine pandas.DataFrames
.
Today, I’ll focus on pandas.concat
, since we have
covered DataFrame merges quite thoroughly in previous weeks. Specifically, we’ll take a look at
DataFrame Inequality Joins, DataFrame Joins & Sets and DataFrame Joins & MultiSets.
In pandas, we have three explicit ways to combine DataFrames:
DataFrame Inequality Joins
- 10 July 2024
Hello, and welcome back to Cameron’s Corner! This week, I want to follow up on two blog posts from a couple months back that discussed DataFrame Joins & Sets and DataFrame Joins & MultiSets.
Instead of speaking more about equality joins, I want to talk about inequality joins. These are a special table join operation that handles conditions when keys don’t match up perfectly, particularly when working with continuous (non-categorical) data.
Flexibility & Ergonomics
- 26 June 2024
Hi all, welcome back to Cameron’s Corner! This week, I want to talk about flexibility and ergonomics.
Oftentimes, we want to write code that is flexible to adapt to the ever-changing problems we are presented with. This often means that we have to write code that anticipates different formulations of an existing business problem. On the other hand, we should also endeavor to write code that is readily usable by our colleagues or other end-users. While these forces—flexibility and ergonomics—may feel like they pull in opposite directions, we should always strive to find a solution where these ideas work in tandem. The most generalized approach we can take to satisfy this is to design APIs with two primary layers of abstraction:
A FlagEnum Categorical in pandas
- 19 June 2024
Hi all, welcome back to Cameron’s Corner! This week, I want to explore the encoding of combinatoric sets (from a
limited pool) inside a pandas.DataFrame
. In more colloquial terms, I want
to explore the following example:
We have a catalog of programming articles & videos (entities).
Tabular Group By Sets
- 12 June 2024
Hi all, welcome back to Cameron’s Corner! This week, I want to replicate some convenient analytical functionality from DuckDB in both pandas and Polars.
Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this covering (even more) Python basics that any aspiring Python expert needs to know in order to make their code more effective and efficient. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.
pandas & Polars: Window Functions vs Group By
- 05 June 2024
Welcome to this week’s Cameron’s Corner! Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this three-session series covering (even more) Python basics that any aspiring Python expert needs to know in order to make their code more effective and efficient. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.
This week, I want to dive back into “window” and “group by” operations. This time, instead of focusing on the SQL syntax, we’ll cover my two favorite DataFrame libraries, pandas and Polars, to discuss the differences in their APIs.
Faster strftime
- 29 May 2024
Welcome back to this week’s Cameron’s Corner! Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this three-session series about (even more) Python basics that experts need to make their code more effective and efficient. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.
On to the topic at hand. I wanted to tackle a fun pandas optimization problem, focusing on converting datetime objects to their date counterparts. For this problem, I did take it “head on,” meaning I did not inquire why the end user wanted this output, just performed some benchmarking on their existing approaches and threw in a couple of my own.
Decorators: Registration Pattern
- 22 May 2024
Hello, everyone! Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this three-session series about (even more) Python basics that experts need to make their code more effective and efficient. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.
Okay, on to this week’s post!
Working With Files Deep in Your Code
- 15 May 2024
Hello, everyone! Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this three-session series about (even more) Python basics for experts. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.
As you may already know, we frequently train corporate teams on topics such as introduction to Python, advanced Python, API design, data analysis, and much more! Our trainings always involve custom curriculum which we tailor to the needs of the team and balance with the expectations of management.
Tables: Window Functions vs Group By
- 08 May 2024
Hello, everyone! This week, I want to dive into “window” and “group by.” What’s the difference? When should you use one over the other? Let’s take a look.
Both window and group by functions are used to perform operations across a subset of rows of a table. These rows are subsetted based on a unique grouping of values within a column.
When the .index is convenient
- 01 May 2024
The blazingly-fast DataFrame library, Polars, has a huge conceptual difference from the
DataFrame veteran, pandas: pandas is
ALL about working with a consistent index
, whereas Polars forces individuals
to work more explicitly using join
s.
I came across a question on Stack Overflow that provided a great example of the benefits of working in an index-aligned way.
DataFrame Joins & MultiSets
- 24 April 2024
There is a fairly strong relationship between table joins and set theory. However, many of the table joins written in SQL, pandas, Polars and the like don’t translate neatly to set logic. In this post, I want to clarify this relationship (and show you some Python and pandas code along the way).
Last week, I covered unique equality joins which describes the simplest scenario in which sets and table join logic completely overlap. This parallels the idea that table joins can be represented with Venn diagrams. This week, I want to show where this mode of thinking tends to fall flat.
DataFrame Joins & Sets
- 17 April 2024
There is a fairly strong relationship between table joins and set theory. However, many of the table joins written in SQL, pandas, Polars and the like don’t translate neatly to set logic. In this blog post, I want to clarify this relationship (and show you some Python and pandas code along the way).
Let’s start with unique equality joins as they are the prototypical representation of a table-join operation. This is also the only type of join that neatly falls into standard set theory (without expanding to multi-sets, which we’ll discuss later).
Parsing Unconventional Text
- 10 April 2024
Hey everyone! I’m back to playing around with Polars again and wanted to share a fun problem I came across on Stack Overflow. In this problem, the OP had some raw textual data in a key-value paired format. However, this format is not one that is commonly supported, like JSON. This means we get to write a custom parser!
We need to read in this data and create a column for each of these fields, appropriately filling in null values for any row that is missing a field that is previously or later defined.
Intentional Visualizations
- 27 March 2024
Hello, everyone! This week, I want to discuss the often-overlooked exploratory charts.
I often speak to a dichotomy of purposes whenever I discuss data visualization. These purposes are designed to help organize our thoughts about both why and how we should visualize our data in the first place. The reasons one might reach for a visualization are:
Timing DataFrame Filters
- 20 March 2024
Hello, everyone! I wanted to follow up on last week’s blog post, Profiling pandas Filters, and test how Polars stands up in its simple filtering operations.
An important note: these timings are NOT intended to be exhaustive and should not be used to determine if one tool is “better” than another.
Profiling pandas Filters
- 13 March 2024
Hello, everyone! For Cameron’s Corner this week, I wanted to spend some time differentiating between various filtering
operations in pandas. Specifically, I wanted to test out operations on
a DatetimeIndex
for working with slices of datetime values.
Let’s do some quick timings for each of these approaches. I’ve ordered them by what my intuition tells me will be slowest to fastest:
Python Set vs Pandas.Index
- 06 March 2024
For the past few weeks, I have been meeting with some fantastic clients in
one-on-one sessions to cover the core Python and pandas skills needed to perform
rapid data analysis. We have discussed a variety of topics, but this week has been one
of my favorites because we are doing a deep dive into pandas. Of course, the
framing for pandas is all about the Index
, so I decided to keep it light and
ensure we tie it back to some core Python concepts.
When discussing the Index
in pandas, I always find it useful to contrast it against
a Python built-in that exhibits some similar behaviors: the set
. This week,
I want to focus on each of these data structures to understand where they overlap, their differences, and the lessons they can teach us.
United States President’s Age
- 14 February 2024
Welcome to Cameron’s Corner! This week, I want to recreate a chart from a post on r/dataisbeautiful by u/graphguy.
Polars Expressions on Nested Data
- 07 February 2024
Welcome back to Cameron’s Corner! This week, I wanted to share another interesting question I came across on Stack Overflow: “How to add numeric value from one column to other List colum elements in Polars?”.
Speaking of Polars, make sure you sign up for our upcoming µTraining, “Blazing Fast Analyses with Polars.” This µtraining is comprised of live discussion, hands-on problem solving, and code review with our instructors. You won’t want to miss it!
Tiered Bar Chart in Matplotlib
- 31 January 2024
Welcome back to Cameron’s Corner! This week, I wanted to share an answer I posted on Stack Overflow to a question entitled Create a bar chart in Python grouping the x-axis by two variables. This question sought to create a grouped bar chart, but also have hierarchical x-tick labels.
The question effectively asked how to create a chart like this:
Good pandas means good Python
- 24 January 2024
Welcome back to Cameron’s Corner! This week, I want to talk about the intersection of Python and pandas. I often hear from other teachers that it is easiest to teach skills that will help students get “up and running.” Unfortunately, this often translates to “let’s teach the pandas API.” This leads to many roadblocks down the line caused by an extremely superficial understanding of how to think about pandas operations or how to best leverage Python to lean into your pandas tasks.
So, let’s take a look at a data-cleaning example, where, while possible, working through pandas will be clumsy.
Polars: Groupby and idxmin
- 17 January 2024
Welcome back to Cameron’s Corner! It’s the third week of January, and, instead of talking about graphs, I want to take a dive into Polars. I recently addressed a question on Polars’ Discord server, diving into the different ways to perform an “index minimum” operation across groups.
Sure, there’s a built-in Expression.idx_min(), but it operates a little differently than it does in pandas. Let’s take a look:
Counting paths in pandas & networkx
- 10 January 2024
Welcome back to Cameron’s Corner! It’s the second week of January, and I’m already here to talk about graphs. No, not the kind we make in Matplotlib, but network graphs! This blog post was inspired by a project I’ve been working on: counting the number of indirect connections between two non-adjacent nodes in a bipartite graph.
In graph theory terms, a graph is bipartite if its nodes are segmented into discrete levels, where nodes from one level connect to nodes from another level but never within the same level. Here is an example from Wikipedia of what a complete bipartite graph might look like:
Don’t Use This Code’s top 10 resolutions of 2024 for YOU!
- 03 January 2024
Hello everyone and welcome to the first Cameron’s Corner of the New Year! Before we get too far, I wanted to just do a quick recap of our year.
In 2023, Don’t Use This Code…
Visualizing Temperature Deviations
- 20 December 2023
This week, I wanted do some data manipulation in Polars and recreate a data visualization I came across a while ago from the Python Graph Gallery, titled “Area Chart Over Flexible Baseline.” I liked this type of chart because it highlights an aggregate measure of interest that is easy to understand and demonstrates how much that measure deviates from some context. In this case, the chart communicates how much the temperature across a given year in a specific city has deviated with respect to historical aggregations.
Most free historical weather data APIs that I have encountered consume latitude and longitude coordinates instead of addresses. However, to make the code I am using here, I am going to use an address API to query the location of a given city/state. We can use the response from this API to feed into the weather API. This makes it very trivial to query different locations across the world!
DataFrame Value Membership Testing
- 13 December 2023
This week, I received a great question on our Discord Server about finding strings within a list in a pandas.Series
.
But, before I get started, I want to invite you to our upcoming µtraining (“micro-training”) that we will be hosting on December 19th and 21st. This unique training format ensures direct interaction with instructors and your peers, providing practical insights and immediate problem-solving guidance.
Playing Scrabble with Xarray
- 06 December 2023
Welcome to Cameron’s Corner! In my last blog post, I explored how to use index-alignment to solve some simple Scrabble problems. Today I want to do the same using Xarray!
But, before I get started, I want to invite you to our upcoming µtraining (“micro-training”) that we will be hosting on December 19th and 21st. This unique training format ensures direct interaction with instructors and your peers, providing practical insights and immediate problem-solving guidance.
Playing Scrabble Faster
- 22 November 2023
Welcome to Cameron’s Corner! This morning, I gave a seminar on coding word games like an expert! I talked about prototyping the game of Scrabble, and wanted to share some additional thoughts I had after the presentation.
But, before I get started, I want to invite you to our next (and final!) seminar in our Python: How the Experts Do It series, “Battleship: An Expert’s Approach to Seemingly Simple Games.” Join us as we embark on the Battleship journey, leveraging Python’s object-oriented prowess to design and implement this iconic game.
Playing (more) Tic-Tac-Toe
- 15 November 2023
Hello everyone and welcome back! Last week, we discussed my live-coded approach (and improvements!) to the game of Tic-Tac-Toe. This week, I wanted to see how flexible my approach is going to be.
But, before we get into it, make sure you register for our next expert lab, “Word Games: An Expert’s Approach to Seemingly Simple Games.” During this session, we’ll unravel the mysteries of word unscrambling in Jumble and challenge ourselves with the strategic wordplay of Scrabble. You’ll witness firsthand how Python’s powerful string manipulation features and other data structures can simplify coding of these games.
Playing Tic-Tac-Toe
- 08 November 2023
Hello, everyone! This week, I held a seminar where I live-coded the game of tic-tac-toe based on some constraints from a client. I wanted to share with you what the final version of this code would look like after a round of review.
Before we get started, I want to tell you about my upcoming seminar with a similar theme, “A Python Expert’s Approach to Rock, Paper, Scissors.” During this seminar, we’ll dissect the game’s rules, design custom Python functions, and explore the strategic thinking behind this simple yet captivating game. We’ll start with the basics, modeling the game using core Python data structures, and then quickly progress to incorporate more advanced features.
When do I Write a Function?
- 21 June 2023
Hey all, this week I wanted to visit a topic that comes up across many of the courses that we teach:
When do I write a function?