Cat looking out window, with dataframe logos outside, original image by  Lucy Jackline https://unsplash.com/photos/a-cat-sitting-on-a-window-sill-looking-out-a-window-O896LIqr2vc
Back to blog

Universal dataframe support with the Arrow PyCapsule Interface + Narwhals

Published January 8, 2025

MarcoGorelli

MarcoGorelli

Marco Gorelli

Universal dataframe support with the Arrow PyCapsule Interface + Narwhals

If you were writing a data science tool in 2015, you'd probably have added pandas support and called it a day. However, it's not 2015, we've fast-forwarded to 2025, and so if you do that today you'll be inundated with endless "can you support Polars / DuckDB / PyArrow / DataFusion / ..." requests. Yet you have no interest in understanding the subtle differences between them, preferring to focus on the problems which your library set out to solve. What can you do?

Today, you'll learn about how to create tools which support all kinds of dataframes:

  • The Arrow PyCapsule Interface, if you need access to dataframe data from a low-level language like C or Rust.
  • Narwhals, if you want to keep your logic in Python.
  • Narwhals and the PyCapsule Interface together, if you want it all!

Summing non-null values in a column - "slow down, Professor!"

We'll learn about how you make a dataframe-agnostic function agnostic_sum_i64_column. Here are the requirements:

  • Given a dataframe df and a column name column_name, compute the sum of all non-null values in that column.
  • If that column is not present, or if it is not of type Int64, raise an error.
  • pandas.DataFrame, polars.DataFrame, duckdb.PyRelation, and pyarrow.Table, should all be supported without any of them being required.

Summing non-null values isn't rocket science, but how can we do this in a way that all input libraries are supported without any of them being required?

Low-level solution: PyCapsule Interface in Rust via PyO3

An example of a Rust solution via Pyo3 can be found at pycapsule-demo/src/lib.rs.

The technical details are beyond the scope of this post, but the summary is:

  • We accept any object which implements the ArrowStreamExportable protocol from the PyCapsule Interface. We can check for this by looking for the __arrow_c_stream__ attribute:


    >>> import polars as pl
    >>> import duckdb
    >>> import pandas as pd
    >>> import pyarrow as pa
    >>> hasattr(pl.DataFrame, '__arrow_c_stream__')
    True
    >>> hasattr(duckdb.DuckDBPyRelation, '__arrow_c_stream__')
    True
    >>> hasattr(pd.DataFrame, '__arrow_c_stream__')
    True
    >>> hasattr(pa.Table, '__arrow_c_stream__')
    True

  • The solution is totally agnostic to the exact input, as it respects a standardised interface. We can verify this by trying to pass different dataframe libraries to the function:


    >>> from pycapsule_demo import agnostic_sum_i64_column
    >>> df_pl = pl.DataFrame({'a': [1,1,2], 'b': [4,5,6]})
    >>> df_pd = pd.DataFrame({'a': [1,1,2], 'b': [4,5,6]})
    >>> df_pa = pa.table({'a': [1,1,2], 'b': [4,5,6]})
    >>> rel = duckdb.sql("select * from df_pd")
    >>> agnostic_sum_i64_column(df_pl, column_name="a")
    6
    >>> agnostic_sum_i64_column(df_pd, column_name="a")
    6
    >>> agnostic_sum_i64_column(df_pa, column_name="a")
    6
    >>> agnostic_sum_i64_column(rel, column_name="a")
    6

Like magic, our function works agnostically for any of these dataframes, without us having to write any specialised code to handle the subtle differences between them!

NOTE: If you try running the above multiple times, you may note that for DuckDB, agnostic_sum_i64_column(rel, column_name="a") can only be called once for a given rel object - a second call would raise. Discussion about whether this is a feature or bug is ongoing at https://github.com/duckdb/duckdb/discussions/15536.

If you found the above example a little daunting, you may be wondering if a simpler solution exists which you can develop entirely in Python-land. Enter: Narwhals.

Python solution: Narwhals

Narwhals is a lightweight, dependency-free, and extensible compatibility layer between dataframe libraries functions. To use it to write a dataframe-agnostic function, you need to follow three steps:

  • Call narwhals.from_native on the user's object.
  • Use the Narwhals API.
  • If you want to return to the user an object of the same kind that they started with, call to_native.

In this case, a Narwhals version of agnostic_sum_i64_column looks like this:


import narwhals as nw
from narwhals.typing import IntoFrame
def agnostic_sum_i64_column_narwhals(df_native: IntoFrame, column_name: str) -> int:
lf = nw.from_native(df_native).lazy()
schema = lf.collect_schema()
if column_name not in schema:
msg = f"Column '{column_name}' not found, available columns are: {schema.names()}."
raise ValueError(msg)
if (dtype := schema[column_name]) != nw.Int64:
msg = f"Column '{column_name}' is of type {dtype}, expected Int64"
raise TypeError(msg)
df = lf.collect()
return df[column_name].sum()

Like above, we can now pass different inputs and get the same result:


>>> agnostic_sum_i64_column_narwhals(df_pl, column_name="a")
6
>>> agnostic_sum_i64_column_narwhals(df_pd, column_name="a")
6
>>> agnostic_sum_i64_column_narwhals(df_pa, column_name="a")
6
>>> agnostic_sum_i64_column_narwhals(rel, column_name="a")
6

So long as you stick to the Narwhals API, your code will keep working with all major dataframe libraries, as well as with new ones which may appear in the future!

Narwhals vs PyCapsule Interface: when to use one over the other, and when to use them together

If the Narwhals API is extensive enough for your use-case, then this is arguably a simpler and easier solution than writing your own Rust function. On the other hand, if you write a Rust function using the PyCapsule Interface, then you have complete control over the data. So, when should you use which?

Let's cover some scenarios:

  • If you want your dataframe logic to stay completely native (e.g. Polars in -> Polars out): use Narwhals.
  • If you want to keep your library logic to pure-Python and without heavy dependencies so it's easy to maintain and install: use Narwhals. Packaging a pure-Python project is very easy, especially compared with if you need to get Rust or C in there.
  • If you want complete control over your data: use the PyCapsule Interface. If you have the necessary Rust / C skills, there should be no limit to how complex and bespoke you make your data processing.
  • If you want to do part of your processing in Python, and part of it in Rust - use both! An example of a library which does this is Vegafusion. This is facilitated by the fact that Narwhals exposes the PyCapsule Interface for both import and export.

What about Polars Plugins?

We wrote some Rust code earlier to express custom logic. You may have come across another way to do this in Polars: Expressions Plugins. How does that differ from writing custom code with the PyCapsule Interface?

If you know that Polars is the only library you need to support, then Polars Plugins are the preferred way to write custom user-defined logic. If you'd like to learn how, we can help!

Conclusion

As the Dataframe ecosystem grows, so does the demand for tools with universal dataframe support (as opposed to those tightly-coupled to pandas). We've learned about how to use the PyCapsule Interface to write dataframe-agnostic code from a low-level language such as Rust or C. We also learned about how we can do that using Narwhals entirely from Python. Finally, we discussed when it may makes sense to use one, the other, or even both approaches together.

If you would like to contribute to efforts towards standardising the data ecosystem and enabling innovation, or would just like to speak with data experts about how we can help your organisation, please book a call with us, we'd love to hear from you!

More articles from our Blog