Posts tagged python

Plotting Timeseries Data in Matplotlib

Plotting time series data in Matplotlib can be a bit tricky, especially when it comes to making your tick labels look clean and readable. Customizing tick locators and formatters helps you get those labels just right, whether you’re dealing with dates or times. In this post, we’ll cover how to use Matplotlib’s Locator and Formatter classes to tweak your time-based ticks. From handling different date ranges to formatting labels in a way that makes sense for your data, we’ll walk through some useful tricks. Let’s jump in and make those time series plots a little easier to read!

In Matplotlib, Locators and Formatters work together to control how the ticks (the markers along the axes) are positioned and labeled in a plot.

Read more ...


Polars Has Inequality Joins!

It finally happened, Polars supports inequality joins — at least, if you are using version 1.7.0 or later. This week I want to tackle a familiar problem from a recent blog post I wrote covering a few different types of joins and when they are useful. This week, I will tackle the same problem in Polars, and discuss how this optimization can simplify our lives and make our compute workloads more manageable by using an inequality join.

In many data analysis scenarios, simple equality joins—matching rows where column values are exactly the same—aren’t enough. A common use case involves working with time-based data, such as tracking state changes of devices and logging alert events for those devices. Here, you don’t just want to match records by a shared device_id; you need to match based on time intervals, like determining which alerts occurred during a specific state. These kinds of joins, known as inequality joins, are essential in areas like time-series analysis, event-driven systems, and continuous monitoring, where relationships between data points depend on conditions beyond simple equality (i.e., a timestamp falling within a specific range).

Read more ...


DataFrames: one-hot encoding and back

Welcome to this week’s edition of Cameron’s Corner! This week, I want to take a quick dive into one-hot encoding and how we can use it within our popular tabular data tools in Python.

For some background, one-hot encoding is a technique used in machine learning and data preprocessing to convert categorical data into a binary format where each category is represented as a unique vector of 0s and 1s. It is commonly used when dealing with categorical variables in algorithms that require numerical input, such as neural networks and decision trees. Each category is transformed into a binary vector with one element set to 1 and all others set to 0, allowing models to interpret the data without assigning implicit ordinal relationships. One-hot encoding is especially useful in cases like text classification, image recognition, and natural language processing tasks.

Read more ...


Plotting without Weekends

Welcome back to Cameron’s Corner! This week, I wanted to touch on a timeseries-oriented data visualization question that I came across: “How do I plot hourly data, but omit the weekends?”

On the surface, this sounds like a simple data-filtering question. However, we also need to consider the visual elements that go into visualizing these data. Take a peek down at the Original Visualization section to see an example of the chart that generated this question.

Read more ...


“Broadcasting” in Polars

Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to look into performance in Polars when working with multiple DataFrames. We are going to cover “broadcasting”—a term used for aligning NumPy arrays against one another—in Polars. Polars always encourages us to be explicit when aligning and working with multiple DataFrames, but there are a few different conceptual approaches we can take to arrive at the same solution. This is where today’s micro-benchmark will come in. We want to find the fastest approach for this specific problem.

Say we have two LazyFrames, where one is of an undetermined length and the other has a known length of 1 (after collection). We want to be able to perform arithmetic across these two LazyFrames as in the following example:

Read more ...


pandas: Months, Days, and Categoricals

Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to discuss how we can leverage pandas.Categorical arrays when working with calendar months and weekdays. This is a bit of a longstanding issue I’ve had with pandas. However, I am not the only one who has thought of this, so I have to respect the priorities of the core developers who contribute their time to this project.

We have some dates stored in a pandas.Series, and, at some point during our analytical pipeline, we need to work with individual months and/or weekdays. We can readily extract the integer value that corresponds with each month (where January ⇒ 1, December ⇒ 12), OR we can extract the string name of the month. The same transformations are available for the day of the week(where Monday ⇒ 0, Sunday ⇒ 6).

Read more ...


Case When: A Welcome Addition to pandas Conveniences

Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to share about a relatively new pandas.Series method: case_when. This function exists to conveniently make replacements across multiple conditions, but instead of describing what it does, I’d rather show you. Let’s jump into our premise.

Suppose you are a teacher grading papers with six students. They all have different grades, which are the following:

Read more ...


pandas: Within the Restricted Computation Domain

Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to share a fascinating discussion I recently had about the Restricted Computation Domain in pandas. Well, it was actually about outlier detection within groups on a pandas DataFrame, but our conversation quickly turned to other topics.

Let’s take a look at the question and the code that kicked everything off. The original question focused on translating the following block of pandas code into Polars.

Read more ...


pandas groupby Along the Columns

Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to discuss a deprecation in the pandas API. Unfortunately, the axis=… argument in pandas.DataFrame.groupby is being deprecated. While it is official, there has been some disagreement within the community on this newest change, primarily because of the conveniences it offers.

But, what is the axis parameter, and what workarounds do we have? Let’s take a look:

Read more ...


pandas.concat, explained.

Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to tackle a pandas question I received concerning the different ways to combine pandas.DataFrames. Today, I’ll focus on pandas.concat, since we have covered DataFrame merges quite thoroughly in previous weeks. Specifically, we’ll take a look at DataFrame Inequality Joins, DataFrame Joins & Sets and DataFrame Joins & MultiSets.

In pandas, we have three explicit ways to combine DataFrames:

Read more ...


DataFrame Inequality Joins

Hello, and welcome back to Cameron’s Corner! This week, I want to follow up on two blog posts from a couple months back that discussed DataFrame Joins & Sets and DataFrame Joins & MultiSets.

Instead of speaking more about equality joins, I want to talk about inequality joins. These are a special table join operation that handles conditions when keys don’t match up perfectly, particularly when working with continuous (non-categorical) data.

Read more ...


Flexibility & Ergonomics

Hi all, welcome back to Cameron’s Corner! This week, I want to talk about flexibility and ergonomics.

Oftentimes, we want to write code that is flexible to adapt to the ever-changing problems we are presented with. This often means that we have to write code that anticipates different formulations of an existing business problem. On the other hand, we should also endeavor to write code that is readily usable by our colleagues or other end-users. While these forces—flexibility and ergonomics—may feel like they pull in opposite directions, we should always strive to find a solution where these ideas work in tandem. The most generalized approach we can take to satisfy this is to design APIs with two primary layers of abstraction:

Read more ...


A FlagEnum Categorical in pandas

Hi all, welcome back to Cameron’s Corner! This week, I want to explore the encoding of combinatoric sets (from a limited pool) inside a pandas.DataFrame. In more colloquial terms, I want to explore the following example:

We have a catalog of programming articles & videos (entities).

Read more ...


Tabular Group By Sets

Hi all, welcome back to Cameron’s Corner! This week, I want to replicate some convenient analytical functionality from DuckDB in both pandas and Polars.

Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this covering (even more) Python basics that any aspiring Python expert needs to know in order to make their code more effective and efficient. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.

Read more ...


pandas & Polars: Window Functions vs Group By

Welcome to this week’s Cameron’s Corner! Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this three-session series covering (even more) Python basics that any aspiring Python expert needs to know in order to make their code more effective and efficient. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.

This week, I want to dive back into “window” and “group by” operations. This time, instead of focusing on the SQL syntax, we’ll cover my two favorite DataFrame libraries, pandas and Polars, to discuss the differences in their APIs.

Read more ...


Faster strftime

Welcome back to this week’s Cameron’s Corner! Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this three-session series about (even more) Python basics that experts need to make their code more effective and efficient. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.

On to the topic at hand. I wanted to tackle a fun pandas optimization problem, focusing on converting datetime objects to their date counterparts. For this problem, I did take it “head on,” meaning I did not inquire why the end user wanted this output, just performed some benchmarking on their existing approaches and threw in a couple of my own.

Read more ...


Decorators: Registration Pattern

Hello, everyone! Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this three-session series about (even more) Python basics that experts need to make their code more effective and efficient. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.

Okay, on to this week’s post!

Read more ...


Working With Files Deep in Your Code

Hello, everyone! Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this three-session series about (even more) Python basics for experts. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.

As you may already know, we frequently train corporate teams on topics such as introduction to Python, advanced Python, API design, data analysis, and much more! Our trainings always involve custom curriculum which we tailor to the needs of the team and balance with the expectations of management.

Read more ...


Tables: Window Functions vs Group By

Hello, everyone! This week, I want to dive into “window” and “group by.” What’s the difference? When should you use one over the other? Let’s take a look.

Both window and group by functions are used to perform operations across a subset of rows of a table. These rows are subsetted based on a unique grouping of values within a column.

Read more ...


When the .index is convenient

The blazingly-fast DataFrame library, Polars, has a huge conceptual difference from the DataFrame veteran, pandas: pandas is ALL about working with a consistent index, whereas Polars forces individuals to work more explicitly using joins.

I came across a question on Stack Overflow that provided a great example of the benefits of working in an index-aligned way.

Read more ...


DataFrame Joins & MultiSets

There is a fairly strong relationship between table joins and set theory. However, many of the table joins written in SQL, pandas, Polars and the like don’t translate neatly to set logic. In this post, I want to clarify this relationship (and show you some Python and pandas code along the way).

Last week, I covered unique equality joins which describes the simplest scenario in which sets and table join logic completely overlap. This parallels the idea that table joins can be represented with Venn diagrams. This week, I want to show where this mode of thinking tends to fall flat.

Read more ...


DataFrame Joins & Sets

There is a fairly strong relationship between table joins and set theory. However, many of the table joins written in SQL, pandas, Polars and the like don’t translate neatly to set logic. In this blog post, I want to clarify this relationship (and show you some Python and pandas code along the way).

Let’s start with unique equality joins as they are the prototypical representation of a table-join operation. This is also the only type of join that neatly falls into standard set theory (without expanding to multi-sets, which we’ll discuss later).

Read more ...


Parsing Unconventional Text

Hey everyone! I’m back to playing around with Polars again and wanted to share a fun problem I came across on Stack Overflow. In this problem, the OP had some raw textual data in a key-value paired format. However, this format is not one that is commonly supported, like JSON. This means we get to write a custom parser!

We need to read in this data and create a column for each of these fields, appropriately filling in null values for any row that is missing a field that is previously or later defined.

Read more ...


Intentional Visualizations

Hello, everyone! This week, I want to discuss the often-overlooked exploratory charts.

I often speak to a dichotomy of purposes whenever I discuss data visualization. These purposes are designed to help organize our thoughts about both why and how we should visualize our data in the first place. The reasons one might reach for a visualization are:

Read more ...


Timing DataFrame Filters

Hello, everyone! I wanted to follow up on last week’s blog post, Profiling pandas Filters, and test how Polars stands up in its simple filtering operations.

An important note: these timings are NOT intended to be exhaustive and should not be used to determine if one tool is “better” than another.

Read more ...


Profiling pandas Filters

Hello, everyone! For Cameron’s Corner this week, I wanted to spend some time differentiating between various filtering operations in pandas. Specifically, I wanted to test out operations on a DatetimeIndex for working with slices of datetime values.

Let’s do some quick timings for each of these approaches. I’ve ordered them by what my intuition tells me will be slowest to fastest:

Read more ...


Python Set vs Pandas.Index

For the past few weeks, I have been meeting with some fantastic clients in one-on-one sessions to cover the core Python and pandas skills needed to perform rapid data analysis. We have discussed a variety of topics, but this week has been one of my favorites because we are doing a deep dive into pandas. Of course, the framing for pandas is all about the Index, so I decided to keep it light and ensure we tie it back to some core Python concepts.

When discussing the Index in pandas, I always find it useful to contrast it against a Python built-in that exhibits some similar behaviors: the set. This week, I want to focus on each of these data structures to understand where they overlap, their differences, and the lessons they can teach us.

Read more ...


United States President’s Age

Welcome to Cameron’s Corner! This week, I want to recreate a chart from a post on r/dataisbeautiful by u/graphguy.

Read more ...


Polars Expressions on Nested Data

Welcome back to Cameron’s Corner! This week, I wanted to share another interesting question I came across on Stack Overflow: “How to add numeric value from one column to other List colum elements in Polars?”.

Speaking of Polars, make sure you sign up for our upcoming µTraining, “Blazing Fast Analyses with Polars.” This µtraining is comprised of live discussion, hands-on problem solving, and code review with our instructors. You won’t want to miss it!

Read more ...


Tiered Bar Chart in Matplotlib

Welcome back to Cameron’s Corner! This week, I wanted to share an answer I posted on Stack Overflow to a question entitled Create a bar chart in Python grouping the x-axis by two variables. This question sought to create a grouped bar chart, but also have hierarchical x-tick labels.

The question effectively asked how to create a chart like this:

Read more ...


Good pandas means good Python

Welcome back to Cameron’s Corner! This week, I want to talk about the intersection of Python and pandas. I often hear from other teachers that it is easiest to teach skills that will help students get “up and running.” Unfortunately, this often translates to “let’s teach the pandas API.” This leads to many roadblocks down the line caused by an extremely superficial understanding of how to think about pandas operations or how to best leverage Python to lean into your pandas tasks.

So, let’s take a look at a data-cleaning example, where, while possible, working through pandas will be clumsy.

Read more ...


Polars: Groupby and idxmin

Welcome back to Cameron’s Corner! It’s the third week of January, and, instead of talking about graphs, I want to take a dive into Polars. I recently addressed a question on Polars’ Discord server, diving into the different ways to perform an “index minimum” operation across groups.

Sure, there’s a built-in Expression.idx_min(), but it operates a little differently than it does in pandas. Let’s take a look:

Read more ...


Counting paths in pandas & networkx

Welcome back to Cameron’s Corner! It’s the second week of January, and I’m already here to talk about graphs. No, not the kind we make in Matplotlib, but network graphs! This blog post was inspired by a project I’ve been working on: counting the number of indirect connections between two non-adjacent nodes in a bipartite graph.

In graph theory terms, a graph is bipartite if its nodes are segmented into discrete levels, where nodes from one level connect to nodes from another level but never within the same level. Here is an example from Wikipedia of what a complete bipartite graph might look like:

Read more ...


Don’t Use This Code’s top 10 resolutions of 2024 for YOU!

Hello everyone and welcome to the first Cameron’s Corner of the New Year! Before we get too far, I wanted to just do a quick recap of our year.

In 2023, Don’t Use This Code…

Read more ...


Visualizing Temperature Deviations

This week, I wanted do some data manipulation in Polars and recreate a data visualization I came across a while ago from the Python Graph Gallery, titled “Area Chart Over Flexible Baseline.” I liked this type of chart because it highlights an aggregate measure of interest that is easy to understand and demonstrates how much that measure deviates from some context. In this case, the chart communicates how much the temperature across a given year in a specific city has deviated with respect to historical aggregations.

Most free historical weather data APIs that I have encountered consume latitude and longitude coordinates instead of addresses. However, to make the code I am using here, I am going to use an address API to query the location of a given city/state. We can use the response from this API to feed into the weather API. This makes it very trivial to query different locations across the world!

Read more ...


DataFrame Value Membership Testing

This week, I received a great question on our Discord Server about finding strings within a list in a pandas.Series.

But, before I get started, I want to invite you to our upcoming µtraining (“micro-training”) that we will be hosting on December 19th and 21st. This unique training format ensures direct interaction with instructors and your peers, providing practical insights and immediate problem-solving guidance.

Read more ...


Playing Scrabble with Xarray

Welcome to Cameron’s Corner! In my last blog post, I explored how to use index-alignment to solve some simple Scrabble problems. Today I want to do the same using Xarray!

But, before I get started, I want to invite you to our upcoming µtraining (“micro-training”) that we will be hosting on December 19th and 21st. This unique training format ensures direct interaction with instructors and your peers, providing practical insights and immediate problem-solving guidance.

Read more ...


Playing Scrabble Faster

Welcome to Cameron’s Corner! This morning, I gave a seminar on coding word games like an expert! I talked about prototyping the game of Scrabble, and wanted to share some additional thoughts I had after the presentation.

But, before I get started, I want to invite you to our next (and final!) seminar in our Python: How the Experts Do It series, “Battleship: An Expert’s Approach to Seemingly Simple Games.” Join us as we embark on the Battleship journey, leveraging Python’s object-oriented prowess to design and implement this iconic game.

Read more ...


Playing (more) Tic-Tac-Toe

Hello everyone and welcome back! Last week, we discussed my live-coded approach (and improvements!) to the game of Tic-Tac-Toe. This week, I wanted to see how flexible my approach is going to be.

But, before we get into it, make sure you register for our next expert lab, “Word Games: An Expert’s Approach to Seemingly Simple Games.” During this session, we’ll unravel the mysteries of word unscrambling in Jumble and challenge ourselves with the strategic wordplay of Scrabble. You’ll witness firsthand how Python’s powerful string manipulation features and other data structures can simplify coding of these games.

Read more ...


Playing Tic-Tac-Toe

Hello, everyone! This week, I held a seminar where I live-coded the game of tic-tac-toe based on some constraints from a client. I wanted to share with you what the final version of this code would look like after a round of review.

Before we get started, I want to tell you about my upcoming seminar with a similar theme, “A Python Expert’s Approach to Rock, Paper, Scissors.” During this seminar, we’ll dissect the game’s rules, design custom Python functions, and explore the strategic thinking behind this simple yet captivating game. We’ll start with the basics, modeling the game using core Python data structures, and then quickly progress to incorporate more advanced features.

Read more ...


When do I Write a Function?

Hey all, this week I wanted to visit a topic that comes up across many of the courses that we teach:

When do I write a function?

Read more ...