Querying multiple backends with Ibis
Published November 13, 2020
In our recent Ibis post, we discussed querying & retrieving data using a familiar Pandas-like interface. That discussion focused on the fluent API that Ibis provides to query structure from a SQLite database—in particular, using a single specific backend. In this post, we'll explore Ibis's ability to answer questions about data using two different Ibis backends.
Ibis in the scientific Python ecosystem
Before we delve into the technical details of using Ibis, we'll consider Ibis in the greater historical context of the scientific Python ecosystem. It was started by Wes McKinney, the creator of Pandas, as way to query information on the Hadoop distributed file system and PySpark. More backends were added later as Ibis became a general tool for data queries.
Throughout the rest of this post, we'll highlight the ability of Ibis to generically prescribe query expressions across different data storage systems.
The design of Ibis backends
Currently, Ibis supports more than ten different backends:
A backend manages the computation of an Ibis query expression; the query expression itself is independent of the computation. A backend implementation that can be queried with Ibis has one of the three following architectures:
- Direct execution backends (e.g., Pandas and HDF5);
- Expression-generating backends that create SQLAlchemy expressions (e.g.,
- String-generating backends (e.g.,
We'll unravel some of the different capabilities of each approach below.
A data-driven history of Ibis compatibility
The table below looks at over 2000 issues in the Ibis project. It provides an annual summary of the issues tagged in Ibis for different backends over a span of six years.
We note an early focus SQLite, SQLAlchemy and Impala.
Later, work began on the Pandas backend rounding out the three different types of backgrounds.
From this point, improvements were made to these key backends as ClickHouse, Spark and PostgreSQL.
For the past two years, Quansight, in partnership with OmniSci, added the
string-generating backend. Further, our responsibilities have expanded
to support Ibis as community maintainers through Quansight Labs.
This collaboration introduced geospatial functionality to Ibis for several backends.
There are currently ongoing efforts to introduce support for SQL Server backends. What other backends would you like to see for Ibis? Maybe our community could benefit from a Dask backend or an Altair backend?
Ibis direct execution backends
Ibis direct execution backends like Pandas and HDF5 operate on conventional in-memory python objects. Pandas is the gold standard for structured data in Python and inspires the API for Ibis.
pd is an Ibis backend based on
expression is an Ibis query that has
We'll recognize the
execute method when we return
pandas.DataFrames from Ibis expression.
compile method does not trigger any computation, rather it constructs an intermediate form
that is interpreted by a backend.
In the case of direction execution backends, the
expression compiles to an the original Ibis
expression. The computation itself is carried out based on a set of recipes defined in Ibis.
In general, we would typically do this work directly in Pandas; however this approach is
practical in making tests for backend-independent expressions.
Ibis expression-generating backends.
Expression-generating backends operate on SQL databases that interoperate with SQLAlchemy.
In the case of expression-generating backends, the intermediate representation is a
SQLAlchemy is the Database Toolkit for Python and Ibis leverages its compatibility
with traditional SQL databases.
Ibis string generating backends.
There are two options for downloading Ibis: using
pip or using
pip: the extra backends from Ibis need to be requested explicitly, e.g.:
conda: all the supported backends (e.g., SQL, Pandas, BigQuery, Omnisci, etc.) are bundled in a single conda package and can be downloaded/installed simultaneously:
String-generating backends allow Ibis to interface with big data systems that manage their own computation. For example, we may connect to an example Omnisci database.
omnisci object is described as a string-generating backend because the intermediate representation of the query is a flavor of SQL.
A string-generating expression compiles to
ibis.omniscidb flavored SQL, while
ibis.bigquery may have a different string representation.
Major credit goes to @xmnlab in his heroic PR to introduce
ibis.omniscidb into Ibis. You can watch
the drama play out in this Github Issue. We'd like to thank the maintainers of Ibis for
their effort in supporting the Ibis community.
Learn more about OmniSci and
ibis.omniscidb in this Quansight Labs post:
Ibis: Python data analysis productivity framework.