Posts tagged pandas

DataFrames: one-hot encoding and back

Welcome to this week’s edition of Cameron’s Corner! This week, I want to take a quick dive into one-hot encoding and how we can use it within our popular tabular data tools in Python.

For some background, one-hot encoding is a technique used in machine learning and data preprocessing to convert categorical data into a binary format where each category is represented as a unique vector of 0s and 1s. It is commonly used when dealing with categorical variables in algorithms that require numerical input, such as neural networks and decision trees. Each category is transformed into a binary vector with one element set to 1 and all others set to 0, allowing models to interpret the data without assigning implicit ordinal relationships. One-hot encoding is especially useful in cases like text classification, image recognition, and natural language processing tasks.

Read more ...

Plotting without Weekends

Welcome back to Cameron’s Corner! This week, I wanted to touch on a timeseries-oriented data visualization question that I came across: “How do I plot hourly data, but omit the weekends?”

On the surface, this sounds like a simple data-filtering question. However, we also need to consider the visual elements that go into visualizing these data. Take a peek down at the Original Visualization section to see an example of the chart that generated this question.

Read more ...

“Broadcasting” in Polars

11 September 2024
Cameron Riddell
python pandas

Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to look into performance in Polars when working with multiple DataFrames. We are going to cover “broadcasting”—a term used for aligning NumPy arrays against one another—in Polars. Polars always encourages us to be explicit when aligning and working with multiple DataFrames, but there are a few different conceptual approaches we can take to arrive at the same solution. This is where today’s micro-benchmark will come in. We want to find the fastest approach for this specific problem.

Say we have two LazyFrames, where one is of an undetermined length and the other has a known length of 1 (after collection). We want to be able to perform arithmetic across these two LazyFrames as in the following example:

Read more ...

pandas: Months, Days, and Categoricals

28 August 2024
Cameron Riddell
python pandas

Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to discuss how we can leverage pandas.Categorical arrays when working with calendar months and weekdays. This is a bit of a longstanding issue I’ve had with pandas. However, I am not the only one who has thought of this, so I have to respect the priorities of the core developers who contribute their time to this project.

We have some dates stored in a pandas.Series, and, at some point during our analytical pipeline, we need to work with individual months and/or weekdays. We can readily extract the integer value that corresponds with each month (where January ⇒ 1, December ⇒ 12), OR we can extract the string name of the month. The same transformations are available for the day of the week(where Monday ⇒ 0, Sunday ⇒ 6).

Read more ...

Case When: A Welcome Addition to pandas Conveniences

21 August 2024
Cameron Riddell
python pandas

Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to share about a relatively new pandas.Series method: case_when. This function exists to conveniently make replacements across multiple conditions, but instead of describing what it does, I’d rather show you. Let’s jump into our premise.

Suppose you are a teacher grading papers with six students. They all have different grades, which are the following:

Read more ...

pandas: Within the Restricted Computation Domain

Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to share a fascinating discussion I recently had about the Restricted Computation Domain in pandas. Well, it was actually about outlier detection within groups on a pandas DataFrame, but our conversation quickly turned to other topics.

Let’s take a look at the question and the code that kicked everything off. The original question focused on translating the following block of pandas code into Polars.

Read more ...

pandas groupby Along the Columns

Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to discuss a deprecation in the pandas API. Unfortunately, the axis=… argument in pandas.DataFrame.groupby is being deprecated. While it is official, there has been some disagreement within the community on this newest change, primarily because of the conveniences it offers.

But, what is the axis parameter, and what workarounds do we have? Let’s take a look:

Read more ...

pandas.concat, explained.

Hello, everyone! Welcome back to Cameron’s Corner. This week, I want to tackle a pandas question I received concerning the different ways to combine pandas.DataFrames. Today, I’ll focus on pandas.concat, since we have covered DataFrame merges quite thoroughly in previous weeks. Specifically, we’ll take a look at DataFrame Inequality Joins, DataFrame Joins & Sets and DataFrame Joins & MultiSets.

In pandas, we have three explicit ways to combine DataFrames:

Read more ...

DataFrame Inequality Joins

Hello, and welcome back to Cameron’s Corner! This week, I want to follow up on two blog posts from a couple months back that discussed DataFrame Joins & Sets and DataFrame Joins & MultiSets.

Instead of speaking more about equality joins, I want to talk about inequality joins. These are a special table join operation that handles conditions when keys don’t match up perfectly, particularly when working with continuous (non-categorical) data.

Read more ...

Flexibility & Ergonomics

Hi all, welcome back to Cameron’s Corner! This week, I want to talk about flexibility and ergonomics.

Oftentimes, we want to write code that is flexible to adapt to the ever-changing problems we are presented with. This often means that we have to write code that anticipates different formulations of an existing business problem. On the other hand, we should also endeavor to write code that is readily usable by our colleagues or other end-users. While these forces—flexibility and ergonomics—may feel like they pull in opposite directions, we should always strive to find a solution where these ideas work in tandem. The most generalized approach we can take to satisfy this is to design APIs with two primary layers of abstraction:

Read more ...

A FlagEnum Categorical in pandas

Hi all, welcome back to Cameron’s Corner! This week, I want to explore the encoding of combinatoric sets (from a limited pool) inside a pandas.DataFrame. In more colloquial terms, I want to explore the following example:

We have a catalog of programming articles & videos (entities).

Read more ...

Tabular Group By Sets

Hi all, welcome back to Cameron’s Corner! This week, I want to replicate some convenient analytical functionality from DuckDB in both pandas and Polars.

Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this covering (even more) Python basics that any aspiring Python expert needs to know in order to make their code more effective and efficient. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.

Read more ...

pandas & Polars: Window Functions vs Group By

Welcome to this week’s Cameron’s Corner! Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this three-session series covering (even more) Python basics that any aspiring Python expert needs to know in order to make their code more effective and efficient. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.

This week, I want to dive back into “window” and “group by” operations. This time, instead of focusing on the SQL syntax, we’ll cover my two favorite DataFrame libraries, pandas and Polars, to discuss the differences in their APIs.

Read more ...

Faster strftime

Welcome back to this week’s Cameron’s Corner! Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this three-session series about (even more) Python basics that experts need to make their code more effective and efficient. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.

On to the topic at hand. I wanted to tackle a fun pandas optimization problem, focusing on converting datetime objects to their date counterparts. For this problem, I did take it “head on,” meaning I did not inquire why the end user wanted this output, just performed some benchmarking on their existing approaches and threw in a couple of my own.

Read more ...

Decorators: Registration Pattern

Hello, everyone! Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this three-session series about (even more) Python basics that experts need to make their code more effective and efficient. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.

Okay, on to this week’s post!

Read more ...

Working With Files Deep in Your Code

Hello, everyone! Before we get started, I want to let you know about our upcoming public seminar series, “(Even More) Python Basics for Experts.” Join James in this three-session series about (even more) Python basics for experts. He’ll tackle what’s real, how we can tell it’s real, and how we can do less work.

As you may already know, we frequently train corporate teams on topics such as introduction to Python, advanced Python, API design, data analysis, and much more! Our trainings always involve custom curriculum which we tailor to the needs of the team and balance with the expectations of management.

Read more ...

Tables: Window Functions vs Group By

Hello, everyone! This week, I want to dive into “window” and “group by.” What’s the difference? When should you use one over the other? Let’s take a look.

Both window and group by functions are used to perform operations across a subset of rows of a table. These rows are subsetted based on a unique grouping of values within a column.

Read more ...

When the .index is convenient

The blazingly-fast DataFrame library, Polars, has a huge conceptual difference from the DataFrame veteran, pandas: pandas is ALL about working with a consistent index, whereas Polars forces individuals to work more explicitly using joins.

I came across a question on Stack Overflow that provided a great example of the benefits of working in an index-aligned way.

Read more ...

DataFrame Joins & MultiSets

There is a fairly strong relationship between table joins and set theory. However, many of the table joins written in SQL, pandas, Polars and the like don’t translate neatly to set logic. In this post, I want to clarify this relationship (and show you some Python and pandas code along the way).

Last week, I covered unique equality joins which describes the simplest scenario in which sets and table join logic completely overlap. This parallels the idea that table joins can be represented with Venn diagrams. This week, I want to show where this mode of thinking tends to fall flat.

Read more ...

DataFrame Joins & Sets

There is a fairly strong relationship between table joins and set theory. However, many of the table joins written in SQL, pandas, Polars and the like don’t translate neatly to set logic. In this blog post, I want to clarify this relationship (and show you some Python and pandas code along the way).

Let’s start with unique equality joins as they are the prototypical representation of a table-join operation. This is also the only type of join that neatly falls into standard set theory (without expanding to multi-sets, which we’ll discuss later).

Read more ...

Parsing Unconventional Text

Hey everyone! I’m back to playing around with Polars again and wanted to share a fun problem I came across on Stack Overflow. In this problem, the OP had some raw textual data in a key-value paired format. However, this format is not one that is commonly supported, like JSON. This means we get to write a custom parser!

We need to read in this data and create a column for each of these fields, appropriately filling in null values for any row that is missing a field that is previously or later defined.

Read more ...

Intentional Visualizations

Hello, everyone! This week, I want to discuss the often-overlooked exploratory charts.

I often speak to a dichotomy of purposes whenever I discuss data visualization. These purposes are designed to help organize our thoughts about both why and how we should visualize our data in the first place. The reasons one might reach for a visualization are:

Read more ...

Timing DataFrame Filters

Hello, everyone! I wanted to follow up on last week’s blog post, Profiling pandas Filters, and test how Polars stands up in its simple filtering operations.

An important note: these timings are NOT intended to be exhaustive and should not be used to determine if one tool is “better” than another.

Read more ...

Profiling pandas Filters

13 March 2024
Cameron Riddell
python pandas

Hello, everyone! For Cameron’s Corner this week, I wanted to spend some time differentiating between various filtering operations in pandas. Specifically, I wanted to test out operations on a DatetimeIndex for working with slices of datetime values.

Let’s do some quick timings for each of these approaches. I’ve ordered them by what my intuition tells me will be slowest to fastest:

Read more ...

Python Set vs Pandas.Index

For the past few weeks, I have been meeting with some fantastic clients in one-on-one sessions to cover the core Python and pandas skills needed to perform rapid data analysis. We have discussed a variety of topics, but this week has been one of my favorites because we are doing a deep dive into pandas. Of course, the framing for pandas is all about the Index, so I decided to keep it light and ensure we tie it back to some core Python concepts.

When discussing the Index in pandas, I always find it useful to contrast it against a Python built-in that exhibits some similar behaviors: the set. This week, I want to focus on each of these data structures to understand where they overlap, their differences, and the lessons they can teach us.

Read more ...

Good pandas means good Python

24 January 2024
Cameron Riddell
python pandas

Welcome back to Cameron’s Corner! This week, I want to talk about the intersection of Python and pandas. I often hear from other teachers that it is easiest to teach skills that will help students get “up and running.” Unfortunately, this often translates to “let’s teach the pandas API.” This leads to many roadblocks down the line caused by an extremely superficial understanding of how to think about pandas operations or how to best leverage Python to lean into your pandas tasks.

So, let’s take a look at a data-cleaning example, where, while possible, working through pandas will be clumsy.

Read more ...

Don’t Forget About the Index!

This week, we have another question from StackOverflow. The question this week features a pandas problem that looks tricky on the surface. However, it becomes quite straightforward once your remember to not forget about the .index.

Specifically, in this problem, we had a data manipulation problem:

Read more ...

Why is DataFrame.corr() so much slower than numpy.corrcoef?

28 June 2023
Cameron Riddell
pandas

Hey all! This week, I encountered a question that reminded me of our upcoming Performance seminar series.

I responded to this question on StackOverflow in which the author noted that calling pandas.DataFrame.corr() was much slower than calling numpy.corrcoef with the following result:

Read more ...

Make Your Naive Code Fast with Polars

19 April 2023
Cameron Riddell
polars pandas

Welcome back to Cameron’s Corner! This week, I presented a seminar on the conceptual comparison between two of the leading DataFrame libraries in the Python Open Source ecosystem: the veteran pandas vs the newest library on the block, Polars.

Polars has been around for over a year now, and since its first release, it has gained a lot of traction. But, what is all of the hype about? Is it some “faster-than-pandas” benchmark? The expression API? Or something else entirely? In my opinion, I’m still going to be using pandas, but Polars does indeed live up to its hype.

Read more ...

Tufte Weather In Matplotlib

Edward Tufte is one of the pioneers of modern-day data visualization. In his work, he is brilliantly able to distill core concepts that can then be applied to nearly any form of visual communication. If you aren’t familiar with his work and are interested in the topic of data visualization in general, I highly recommend Tufte’s book, “The Visual Display of Quantitative Information”.

Given the era in which they were created, almost all of Tufte’s original works are hand drawn, relying on pen and paper. As such, his work is as artistic as it is informational, providing unique visualizations that focus attention and convey meaningful messages.

Read more ...

What the Index?

Hello, world! My schedule is jam-packed this week getting ready for my upcoming seminar, “Spot the Lies Told by this Data,” but even that can’t take me away from Cameron’s Corner! This week, I want to discuss my old friend, the Index.

I’ve taught pandas to numerous colleagues and clients, and the most important lesson to learn when working with this tool is to always respect the Index.

Read more ...

Working With Slightly Messy Data

Hello, everyone! This week, I want to discuss working with real-world datasets. Specifically, how it’s common (and even expected) to encounter a number of data quality issues.

Some common questions you want to ask yourself when working with a new dataset are…

Read more ...

Dealing With Dates in pandas - Part 3

Welcome back, everyone!

In my previous post, we discussed how we can work effectively with datetimes in pandas, including how to parse datetimes, query our dataframe based on datetimes, and perform datetime-aware index alignment. This week, we’ll be exploring one final introductory feature for working with datetimes in pandas.

Read more ...

Dealing With Dates in Pandas - Part 2

In my previous post, we discussed how we can approach date times in pandas as well as the metaphors used by the library and the differences between absolute time and calendar time (also referred to as relative time).

This week, we’ll dive a little bit deeper into the functionality that pandas has to offer when dealing with time series data, covering topics like:

Read more ...

Dealing With Dates in Pandas - Part 1

So how do we work with dates and times in pandas? Well if we need to ensure our operations are as performant as possible we’ll need to reach into pandas restricted computation domain, and that means using its objects and playing by its rules.

Fortunately, the metaphors we’ve discussed about date times along the way still hold

Read more ...

pandas Groupby: split-?-combine

When choosing what groupby operations to run, pandas offers many options. Namely, you can choose to use one of these three:

agg or aggregate

Read more ...

Unconventional Pandas: Colormaps

Hello everyone! We have some exciting events coming up, including a NEW seminar series and a code review workshop series. In our brand new seminar series, we will share with you some of the hardest problems we have had to solve in pandas and NumPy (and, in our bonus session on September 16th, hard problems that we have had to solve in Matplotlib!). Then, next month starting October 12th, we will be holding our first ever “No Tears Code Review,” where we’ll take attendees througha a code review that will actually help them gain insight into their code and cause meaningful improvements to their approach.

Let’s get to the exciting content!

Read more ...

Karnaugh Maps In Pandas

As many of you know, I held a session on Logic this past month as part of “All the Computer Science You Never Took in College”. While I have never taken a computer science class in my life, I resonated with many of these concepts as things that I had encountered

A fun example that I presented used pandas to simplify a boolean algebra expression via a Karnaugh map. Karnaugh maps are useful tabular representations of boolean expressions that we can use to visually simplify this expression to a disjunctive form.

Read more ...

Pandas: SettingWithCopyWarning

Wrapping up June already?! I can’t believe how quickly things are moving.

I wanted to take some time today to discuss one of the most common issues facing pandas users: SettingWithCopyWarning

Read more ...

Pandas - What Else Can You .groupby?

Hey there! Welcome to the first DUTC newsletter of March 2022! We have had an action packed start to the year and are eager to keep the trainings coming. Next month, in March, we are unveiling a new lineup of weekly seminars titled: Confident Queries & Stronger SQL. Where we will help to not only refine your SQL skills, but also but also convey the underlying framework and mental models that power the most commonly used database querying language in the world. And if that isn’t enough to get excited about, then you should be excited for my next presentation where I’ll be comparing Pandas vs SQL to address the similarities and differences between these tools. What types of analyses are possible with either tool, how often do they overlap, and when they do- which one should I use? All of these questions and more will be answered this March! So make sure you register now for our SQL seminar series.

Not only do we have SQL sessions upcoming, but we also have an upcoming Developing Expertise in Python and pandas course this April 18-21! Our developing expertise courses are easily my favorite content we offer. The ability to sit down in a small group and address problems in a paired-programming environment provides the most impactful form of learning. Not only do you get to ask any question about syntax, concepts, and approaches- but you can do so in a safe environment while learning best practices within the PyData stack. If you want to bridge the gap from an intermediate Python programmer to become an expert Pythonista (RUN THIS TERMINOLOGY BY JAMES), then I can not recommend this course enough. We work tirelessly to create a balanced and custom curriculum to meet the goals of all of our attendees.

Read more ...