Ibis: Python data analysis productivity framework

Ibis is a library pretty useful on data analysis tasks that provides a pandas-like API that allows operations like create filter, add columns, apply math operations etc in a lazy mode so all the operations are just registered in memory but not executed and when you want to get the result of the expression you created, Ibis compiles that and makes a request to the remote server (remote storage and execution systems like Hadoop components or SQL databases). Its goal is to simplify analytical workflows and make you more productive.

Ibis was created by Wes McKinney and is mainly maintained by Phillip Cloud and Krisztián Szűcs. Also, recently, I was invited to become a maintainer of the Ibis repository!

Maybe you are thinking: "why should I use Ibis?". Well, if you have any of the following issues, probably you should consider using Ibis in your analytical workflow!

  • if you need to get data from a SQL database but you don't know much about SQL ...
  • if you create SQL statements manually using string and have a lot of IF's in your code that compose specific parts of your SQL code (it could be pretty hard to maintain and it will makes your code pretty ugly) ...
  • if you need to handle data with a big volume ...

uarray update: API changes, overhead and comparison to __array_function__

uarray is a generic override framework for objects and methods in Python. Since my last uarray blogpost, there have been plenty of developments, changes to the API and improvements to the overhead of the protocol. Let’s begin with a walk-through of the current feature set and API, and then move on to current developments and how it compares to __array_function__. For further details on the API and latest developments, please see the API page for uarray. The examples there are doctested, so they will always be current.


Other array objects

NumPy is a simple, rectangular, dense, and in-memory data store. This is great for some applications but isn't complete on its own. It doesn't encompass every single use-case. The following are examples of array objects available today that have different features and cater to a different kind of audience.

  • Dask is one of the most popular ones. It allows distributed and chunked computation.
  • CuPy is another popular one, and allows GPU computation.
  • PyData/Sparse is slowly gaining popularity, and is a sparse, in-memory data store.
  • XArray includes named dimensions.
  • Xnd is another effort to re-write and modernise the NumPy API, and includes support for GPU arrays and ragged arrays.
  • Another effort (although with no Python wrapper, only data marshalling) is xtensor.

Some of these objects can be composed. Namely, Dask both expects and exports the NumPy API, whereas XArray expects the NumPy API. This makes interesting combinations possible, such as distributed sparse or GPU arrays, or even labelled distributed sparse or CPU/GPU arrays.

Also, there are many other libraries (a popular one being scikit-learn) that need a back-end mechanism in order to be able to support different kinds of array objects. Finally, there is a desire to see SciPy itself gain support for other array objects.

__array_function__ and its limitations

One of my motivations for working on uarray were the limitations of the __array_function__ protocol, defined in this proposal. The limitations are threefold:

  • It can only dispatch on array objects.
  • Consequently, it can only dispatch on functions that accept array objects.
  • It has no mechanism for conversion and coercion.
  • Since it conflates arrays and backends, only a single backend type per array object is possible.

These limitations have been partially discussed before.

uarray — The solution?

With that out of the way, let's explore uarray, a library that hopes to resolve these issues, and even though the original motivation was NumPy and array computing, the library itself is meant to be a generic multiple-dispatch mechanism.

In [1]:
# Enable __array_function__ for NumPy < 1.17.0
In [2]:
import uarray as ua
import numpy as np

In uarray, the fundamental building block is a multimethod. Multimethods have a number of nice properties, such as automatic dispatch based on backends. It is important to note here that multimethods will be written by API authors, rather than implementors. Here's how we define a multimethod in uarray:

Read more…

Labs update and May highlights

Time flies when you're having fun. Here is an update of some of the highlights of my second month at Quansight Labs.

The making of a black hole image & GitHub Sponsors

Both Travis and myself were invited by GitHub to attend GitHub Satellite in Berlin. The main reason was that Nat Friedman (GitHub CEO) decided to spend the first 20 minutes of his keynote to highlight the Event Horizon Telescope's black hole image and the open source software that made that imaging possible. This included the scientific Python very prominently - NumPy, Matplotlib, Python, Cython, SciPy, AstroPy and other projects were highlighted. At the same time, Nat introduced new GitHub features like "used by", a triaging role and new dependency graph features and illustrated how those worked for NumPy. These features will be very welcome news to maintainers of almost any project.

GitHub Satellite'19 keynote, showcasing NumPy and Matplotlib

The single most visible feature introduced was GitHub Sponsors:

GitHub Sponsors enabled on the NumPy repo

I really enjoyed meeting Devon Zuegel, Product Manager of the Open Source Economy Team at GitHub, in person after previously having had the chance to exchange ideas with her about the funding related needs of scientific Python projects and their core teams. I'm confident that GitHub Sponsors will evolve in a direction that's beneficial to community-driven open source projects.

Read more…