Posts tagged polars
Polars Has Inequality Joins!
- 02 October 2024
It finally happened, Polars supports inequality joins — at least, if you are
using version 1.7.0
or later.
This week I want to tackle a familiar problem from
a recent blog post I wrote covering a few
different types of joins and when they are useful. This week, I will tackle
the same problem in Polars, and discuss how this optimization can simplify
our lives and make our compute workloads more manageable by using an inequality
join.
In many data analysis scenarios, simple equality joins—matching rows where column values are exactly the same—aren’t enough. A common use case involves working with time-based data, such as tracking state changes of devices and logging alert events for those devices. Here, you don’t just want to match records by a shared device_id; you need to match based on time intervals, like determining which alerts occurred during a specific state. These kinds of joins, known as inequality joins, are essential in areas like time-series analysis, event-driven systems, and continuous monitoring, where relationships between data points depend on conditions beyond simple equality (i.e., a timestamp falling within a specific range).
pandas: Within the Restricted Computation Domain
- 14 August 2024
Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to share a fascinating discussion I recently had about the Restricted Computation Domain in pandas. Well, it was actually about outlier detection within groups on a pandas DataFrame, but our conversation quickly turned to other topics.
Let’s take a look at the question and the code that kicked everything off. The original question focused on translating the following block of pandas code into Polars.
DataFrame Inequality Joins
- 10 July 2024
Hello, and welcome back to Cameron’s Corner! This week, I want to follow up on two blog posts from a couple months back that discussed DataFrame Joins & Sets and DataFrame Joins & MultiSets.
Instead of speaking more about equality joins, I want to talk about inequality joins. These are a special table join operation that handles conditions when keys don’t match up perfectly, particularly when working with continuous (non-categorical) data.
pandas & Polars: Window Functions vs Group By
- 05 June 2024
Welcome to this week’s Cameron’s Corner! Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this three-session series covering (even more) Python basics that any aspiring Python expert needs to know in order to make their code more effective and efficient. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.
This week, I want to dive back into “window” and “group by” operations. This time, instead of focusing on the SQL syntax, we’ll cover my two favorite DataFrame libraries, pandas and Polars, to discuss the differences in their APIs.
Decorators: Registration Pattern
- 22 May 2024
Hello, everyone! Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this three-session series about (even more) Python basics that experts need to make their code more effective and efficient. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.
Okay, on to this week’s post!
Working With Files Deep in Your Code
- 15 May 2024
Hello, everyone! Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this three-session series about (even more) Python basics for experts. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.
As you may already know, we frequently train corporate teams on topics such as introduction to Python, advanced Python, API design, data analysis, and much more! Our trainings always involve custom curriculum which we tailor to the needs of the team and balance with the expectations of management.
Tables: Window Functions vs Group By
- 08 May 2024
Hello, everyone! This week, I want to dive into “window” and “group by.” What’s the difference? When should you use one over the other? Let’s take a look.
Both window and group by functions are used to perform operations across a subset of rows of a table. These rows are subsetted based on a unique grouping of values within a column.
When the .index is convenient
- 01 May 2024
The blazingly-fast DataFrame library, Polars, has a huge conceptual difference from the
DataFrame veteran, pandas: pandas is
ALL about working with a consistent index
, whereas Polars forces individuals
to work more explicitly using join
s.
I came across a question on Stack Overflow that provided a great example of the benefits of working in an index-aligned way.
DataFrame Joins & MultiSets
- 24 April 2024
There is a fairly strong relationship between table joins and set theory. However, many of the table joins written in SQL, pandas, Polars and the like don’t translate neatly to set logic. In this post, I want to clarify this relationship (and show you some Python and pandas code along the way).
Last week, I covered unique equality joins which describes the simplest scenario in which sets and table join logic completely overlap. This parallels the idea that table joins can be represented with Venn diagrams. This week, I want to show where this mode of thinking tends to fall flat.
DataFrame Joins & Sets
- 17 April 2024
There is a fairly strong relationship between table joins and set theory. However, many of the table joins written in SQL, pandas, Polars and the like don’t translate neatly to set logic. In this blog post, I want to clarify this relationship (and show you some Python and pandas code along the way).
Let’s start with unique equality joins as they are the prototypical representation of a table-join operation. This is also the only type of join that neatly falls into standard set theory (without expanding to multi-sets, which we’ll discuss later).
Parsing Unconventional Text
- 10 April 2024
Hey everyone! I’m back to playing around with Polars again and wanted to share a fun problem I came across on Stack Overflow. In this problem, the OP had some raw textual data in a key-value paired format. However, this format is not one that is commonly supported, like JSON. This means we get to write a custom parser!
We need to read in this data and create a column for each of these fields, appropriately filling in null values for any row that is missing a field that is previously or later defined.
Intentional Visualizations
- 27 March 2024
Hello, everyone! This week, I want to discuss the often-overlooked exploratory charts.
I often speak to a dichotomy of purposes whenever I discuss data visualization. These purposes are designed to help organize our thoughts about both why and how we should visualize our data in the first place. The reasons one might reach for a visualization are:
Timing DataFrame Filters
- 20 March 2024
Hello, everyone! I wanted to follow up on last week’s blog post, Profiling pandas Filters, and test how Polars stands up in its simple filtering operations.
An important note: these timings are NOT intended to be exhaustive and should not be used to determine if one tool is “better” than another.
Polars Expressions on Nested Data
- 07 February 2024
Welcome back to Cameron’s Corner! This week, I wanted to share another interesting question I came across on Stack Overflow: “How to add numeric value from one column to other List colum elements in Polars?”.
Speaking of Polars, make sure you sign up for our upcoming µTraining, “Blazing Fast Analyses with Polars.” This µtraining is comprised of live discussion, hands-on problem solving, and code review with our instructors. You won’t want to miss it!
Polars: Groupby and idxmin
- 17 January 2024
Welcome back to Cameron’s Corner! It’s the third week of January, and, instead of talking about graphs, I want to take a dive into Polars. I recently addressed a question on Polars’ Discord server, diving into the different ways to perform an “index minimum” operation across groups.
Sure, there’s a built-in Expression.idx_min(), but it operates a little differently than it does in pandas. Let’s take a look:
Make Your Naive Code Fast with Polars
- 19 April 2023
Welcome back to Cameron’s Corner! This week, I presented a seminar on the conceptual comparison
between two of the leading DataFrame
libraries in the Python Open Source
ecosystem: the veteran pandas vs the newest library on the block, Polars.
Polars has been around for over a year now, and since its first release, it has gained a lot of traction. But, what is all of the hype about? Is it some “faster-than-pandas” benchmark? The expression API? Or something else entirely? In my opinion, I’m still going to be using pandas, but Polars does indeed live up to its hype.