Mastering DuckDB when you're used to pandas or Polars
Published February 17, 2025
MarcoGorelli
Marco Gorelli
Mastering DuckDB when you're used to pandas or Polars
You may have heard about DuckDB's impressive robustness and performance. Perhaps you want to try it out - BUT WAIT, you're a data scientist and are used to pandas and/or Polars, not SQL. You can use the SELECT
, JOIN
and GROUP BY
commands, but not much more, and you may be wondering: is it even possible to use SQL to:
- Center a variable (i.e. subtract its mean)?
- Resample by time?
- Compute rolling statistics?
Not only are these all possible, but they're also easy. Let's learn how to implement dataframe fundamentals in SQL!
But first - why?
Why use DuckDB / SQL at all? Aren't dataframe APIs more readable and expressive anyway? Arguably, yes. Nonetheless, I think there are some very good reasons to implement a DuckDB SQL solution if you're able to:
- Stability: dataframe APIs tend to go through deprecation cycles to make API improvements. If you write a dataframe solution today, it's unlikely that it will still work 5 years from now. A SQL one, on the other hand, probably will.
- Portability: SQL standards exist, and although implementation differences exist, migrating between SQL dialects is probably less painful than migrating between dataframe APIs.
- Widespreadness: analysts, engineers, and data scientists across industries are all likely familiar with SQL. They may not all rank it as their favourite language, but they can probably all read it, especially with the help of an LLM.
- Robustness: extensive SQL testing frameworks, such as sqllogictest, have already been developed, and so DuckDB can test against them to guard against buggy query results.
Furthermore, although classic SQL tends to have some annoying rules (such as "no comma after the last expression in SELECT!"), DuckDB has innovated on the syntax side with their Friendly SQL.
Let's now look at translating common dataframe tasks into SQL.
Subtracting the mean
Subtracting the mean, also known as "centering", is a common data science technique performed before fitting classical regression models. In pandas or Polars, it's trivial:
data = {"a": [1, 3, -1, 8]}# pandasimport pandas as pddf_pd = pd.DataFrame(data)df_pd["a_centered"] = df_pd["a"] - df_pd["a"].mean()# Polarsimport polars as pldf_pl = pl.DataFrame(data)df_pl.with_columns(a_centered=pl.col("a") - pl.col("a").mean())
shape: (4, 2)┌─────┬────────────┐│ a ┆ a_centered ││ --- ┆ --- ││ i64 ┆ f64 │╞═════╪════════════╡│ 1 ┆ -1.75 ││ 3 ┆ 0.25 ││ -1 ┆ -3.75 ││ 8 ┆ 5.25 │└─────┴────────────┘
If you naively try translating to SQL, however, you'll get an error:
import duckdbduckdb.sql( """ SELECT *, a - MEAN(a) AS a_centered FROM df_pl """)
BinderException: Binder Error: column "a" must appear in the GROUP BY clause or must be part of an aggregate function.Either add it to the GROUP BY list, or use "ANY_VALUE(a)" if the exact value of "a" is not important.
SQL does not let us compare columns with aggregates. To do so, we need to use a window function, which is a kind of function that produces a value for each row. We're taking the mean of column 'a'
over the entire column, so we write:
duckdb.sql( """ SELECT *, a - MEAN(a) OVER () AS a_centered FROM df_pl """)
┌───────┬────────────┐│ a │ a_centered ││ int64 │ double │├───────┼────────────┤│ 1 │ -1.75 ││ 3 │ 0.25 ││ -1 │ -3.75 ││ 8 │ 5.25 │└───────┴────────────┘
Resampling: weekly average
Say we have unevenly spaced temporal data, such as:
from datetime import datetimedates = [ datetime(2025, 1, 1), # Wednesday datetime(2025, 1, 7), # Tuesday datetime(2025, 1, 8), # Wednesday datetime(2025, 1, 9), # Thursday datetime(2025, 1, 16), # Thursday datetime(2025, 1, 17), # Friday]sales = [1, 5, 0, 4, 3, 6]data = {"date": dates, "sales": sales}
We need to find the average weekly sales, where a week is defined as Wednesday to Tuesday. In pandas we'd use resample
, in Polars group_by_dynamic
:
# pandasimport pandas as pddf_pd = pd.DataFrame(data)df_pd.resample("1W-Wed", on="date", closed="left", label="left")["sales"].mean()# Polarsimport polars as pldf_pl = pl.DataFrame(data)( df_pl.group_by_dynamic( pl.col("date").alias("week_start"), every="1w", start_by="wednesday" ).agg(pl.col("sales").mean()))
shape: (3, 2)┌─────────────────────┬───────┐│ date ┆ sales ││ --- ┆ --- ││ datetime[μs] ┆ f64 │╞═════════════════════╪═══════╡│ 2025-01-01 00:00:00 ┆ 3.0 ││ 2025-01-08 00:00:00 ┆ 2.0 ││ 2025-01-15 00:00:00 ┆ 4.5 │└─────────────────────┴───────┘
Replicating this in DuckDB is not rocket science, but it does involve a preprocessing step:
- We use
DATE_TRUNC('week', date)
to truncate each date to the Monday at the start of the Monday-Sunday week. - To get our week to start on Wednesday, we need to first subtract 2 days, then truncate, and then add 2 days back:
DATE_TRUNC('week', date - INTERVAL 2 DAYS) + INTERVAL 2 DAYS AS week_start
In code:
import duckdbduckdb.sql( """ SELECT DATE_TRUNC('week', date - INTERVAL 2 DAYS) + INTERVAL 2 DAYS AS week_start, AVG(sales) AS sales FROM df_pl GROUP BY week_start ORDER BY week_start """)
┌─────────────────────┬────────┐│ week_start │ sales ││ timestamp │ double │├─────────────────────┼────────┤│ 2025-01-01 00:00:00 │ 3.0 ││ 2025-01-08 00:00:00 │ 2.0 ││ 2025-01-15 00:00:00 │ 4.5 │└─────────────────────┴────────┘
NOTE: In general, we recommend only using
ORDER BY
as late as possible in your queries, and until that point, not making any assumptions about the physical ordering of your data. You'll see in the next section how to get around physical ordering assumptions when performing order-dependent operations.
Rolling and tumbling statistics
If you work in finance, then rolling means are probably your bread and butter. For example, with data:
from datetime import datetimedates = [ datetime(2025, 1, 1), datetime(2025, 1, 2), datetime(2025, 1, 3), datetime(2025, 1, 4), datetime(2025, 1, 5), datetime(2025, 1, 7),]sales = [2.0, 4.6, 1.32, 1.11, 9, 8]data = {"date": dates, "sales": sales}
you may want to smooth out 'sales'
by taking a rolling average over the last three data points. With dataframes, it's easy:
# pandasimport pandas as pddf_pd = pd.DataFrame(data)df_pd["sales_smoothed"] = df_pd["sales"].rolling(3).mean()# Polarsimport polars as pldf_pl = pl.DataFrame(data)df_pl.with_columns(sales_smoothed=pl.col("sales").rolling_mean(3))
shape: (6, 3)┌─────────────────────┬───────┬────────────────┐│ date ┆ sales ┆ sales_smoothed ││ --- ┆ --- ┆ --- ││ datetime[μs] ┆ f64 ┆ f64 │╞═════════════════════╪═══════╪════════════════╡│ 2025-01-01 00:00:00 ┆ 2.0 ┆ null ││ 2025-01-02 00:00:00 ┆ 4.6 ┆ null ││ 2025-01-03 00:00:00 ┆ 1.32 ┆ 2.64 ││ 2025-01-04 00:00:00 ┆ 1.11 ┆ 2.343333 ││ 2025-01-05 00:00:00 ┆ 9.0 ┆ 3.81 ││ 2025-01-07 00:00:00 ┆ 8.0 ┆ 6.036667 │└─────────────────────┴───────┴────────────────┘
We're relying on our data being sorted by 'date'
. In pandas / Polars, we often know that our data is ordered in a particular way, and that order is often preserved across operations, so calculating a rolling mean with ordering assumptions is fine. For SQL engines however, row order is typically undefined, although there are some limited cases where DuckDB promises to maintain order. The solution is to specify 'ORDER BY'
inside your window function - this tells the engine which column(s) to use to determine the order in which to compute the rolling mean:
import duckdbduckdb.sql( """ SELECT *, MEAN(sales) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS sales_smoothed FROM df_pl """)
┌─────────────────────┬────────┬────────────────────┐│ date │ sales │ sales_smoothed ││ timestamp │ double │ double │├─────────────────────┼────────┼────────────────────┤│ 2025-01-01 00:00:00 │ 2.0 │ 2.0 ││ 2025-01-02 00:00:00 │ 4.6 │ 3.3 ││ 2025-01-03 00:00:00 │ 1.32 │ 2.64 ││ 2025-01-04 00:00:00 │ 1.11 │ 2.3433333333333333 ││ 2025-01-05 00:00:00 │ 9.0 │ 3.81 ││ 2025-01-07 00:00:00 │ 8.0 │ 6.036666666666666 │└─────────────────────┴────────┴────────────────────┘
This gets us close to the pandas/Polars output, but it's not identical - notice how the first two rows are null in the dataframe case, but non-null in the SQL case! This is because the dataframe solution only computes the mean when there are at least window_size
(in this case, 3) observations per window, whereas the DuckDB output computes the mean for every window. We can remedy this by using a case statement (and also a named window function for readability):
import duckdbduckdb.sql( """ SELECT *, CASE WHEN (COUNT(sales) OVER w) >= 3 THEN MEAN(sales) OVER w ELSE NULL END AS sales_smoothed FROM df_pl WINDOW w AS (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) """)
┌─────────────────────┬────────┬────────────────────┐│ date │ sales │ sales_smoothed ││ timestamp │ double │ double │├─────────────────────┼────────┼────────────────────┤│ 2025-01-01 00:00:00 │ 2.0 │ NULL ││ 2025-01-02 00:00:00 │ 4.6 │ NULL ││ 2025-01-03 00:00:00 │ 1.32 │ 2.64 ││ 2025-01-04 00:00:00 │ 1.11 │ 2.3433333333333333 ││ 2025-01-05 00:00:00 │ 9.0 │ 3.81 ││ 2025-01-07 00:00:00 │ 8.0 │ 6.036666666666666 │└─────────────────────┴────────┴────────────────────┘
Now it perfectly matches the pandas / Polars output exactly 😇!
What if you don't like SQL?
If you want to use DuckDB as an engine but prefer Python APIs, some available options are:
- SQLFrame: transpiles the PySpark API to different backends, including DuckDB.
- DuckDB's Python Relational API: very strict and robust, though documentation is quite scant. In particular, window expressions are not yet supported (but they are on the roadmap!).
- Narwhals: transpiles the Polars API to different backends. For DuckDB it uses DuckDB's Python Relational API, and so it also does not yet support window expressions.
- Ibis: transpiles its own API to different backends.
What's more, DuckDB allows you to write queries against in-memory pandas and Polars dataframes. There's nothing wrong with mixing and matching tools - in fact, that'll probably take you further than swearing by a single tool and trying to do everything using just that.
Conclusion
We've learned how to translate some common dataframe operations to SQL so that we can port them over to DuckDB. We looked at centering, resampling, and rolling statistics. Porting to SQL / DuckDB may be desirable if you would like to use the DuckDB engine, if your client and/or team prefer SQL to dataframe APIs, or if you would like to have a robust and mostly standardised solution which is unlikely to break in the future.
If you would like help implementing solutions with any of the tools covered in this post or would like to sponsor efforts toward dataframe API unification, we can help!