Making GPUs accessible to the PyData Ecosystem via the Array API Standard.
Published March 31, 2022
GPUs have become an essential part of the scientific computing stack and with the advancement in the technology around GPUs and the ease of accessing a GPU in the cloud or on-prem, it is in the best interest of the PyData community to spend time and effort to make GPUs accessible for the users of PyData libraries. A typical user in the PyData ecosystem is quite familiar with the APIs of libraries like SciPy, scikit-learn, and scikit-image -- and at the moment these libraries are largely limited to single-threaded operations on CPU (there are exceptions to that, like linear algebra functions and scikit-learn functionality which uses OpenMP under the hood). In this blog post I will talk about how we can use the Python Array API Standard with the fundamental libraries in the PyData ecosystem along with CuPy for making GPUs accessible to the users of these libraries. With the introduction of that standard by the Consortium for Python Data API Standards and its adoption mechanism in NEP 47 it is now possible to write code that is portable between NumPy and other array/tensor libraries that adopt the Array API Standard. We will also discuss the workflow and challenges for actually achieving such portability.
Motivation and goal
The motivation for this exercise comes from the fact that having GPU-specific code in SciPy and scikits will be too-specialized and very hard to maintain. The goal of this demo is to (a) to show things working on GPUs in a real-world code example using SciPy, scikit-learn and scikit-image together, and (b) demonstrate that there is a significant performance benefit. We'll be using CuPy as the GPU library for this demo. Since CuPy has support for NVIDIA GPUs as well as AMD’s ROCm GPU platform, this demo will work on either of those GPUs.
The ultimate goal is to have SciPy, scikit-learn and scikit-image accept any type of array, so they can work with CuPy, PyTorch, JAX, Dask and other libraries. This work is part of a larger project sponsored by AMD for extensibility to GPU & distributed support for SciPy, scikit-learn, scikit-image and beyond.
A primer on the Array API Standard
As of today, NumPy with over 100 million monthly downloads (combined from conda and PyPI) is the fundamental library of choice for array computing. In the past decade a plethora of array libraries have evolved, which more or less derives from NumPy’s API and its API is not very well defined. This is due to the fact that NumPy is a CPU-only library and today these dependent libraries are built for a variety of exotic hardware. As a consequence the APIs of these libraries have diverged a lot, So much so that it’s quite difficult to write code that works with multiple (or all) of these libraries. The Array API standard aims to address this issue, by specifying an API for the most common ways arrays are constructed and used. At the time of writing this, NumPy (>=1.22.0) and CuPy (>=10.0.0) have adopted the Array API Standard.
Note about using the Array API Standard
The definition of the Array API Standard and its usage may evolve over time. This blog post is only meant to illustrate how we can use it to make GPUs accessible to users of libraries like SciPy and scikits and the performance benefits of that. It does not necessarily define any best practices around adopting the standard. Such adoption best practices are still a work in progress and will evolve over the next year.
The example: image segmentation through spectral clustering
The example we have chosen to demonstrate adopting the Array API standard to make GPUs accessible in SciPy and scikits is the one taken from scikit-learn’s documentation: segmenting a picture of Greek coins into regions. This example uses spectral clustering on a graph created from voxel-to-voxel difference on an image to break this image into multiple partly-homogeneous regions (or clusters). In essence spectral clustering is about finding the clusters in a graph, i.e., finding the mostly connected subgraph and thereby identifying clusters. Alternatively, it can be described as finding subgraphs having a maximum number of within-cluster connections and a minimum number of between-cluster connections.
Greek coins from Pompeii ("British Museum, London, England").
The dataset we are using for this demonstration is of greek coins from Pompeii
skimage.data.coins module). This is a data of image of several
coins outlined against a gray background.
We convert the data of the coins into a graph object and then apply spectral clustering to it for segmenting coins.
The process for making the code run on GPU
get_namespace - getting the array API module
NEP 47, "Adopting the array API standard", outlines a basic workflow for adopting the array API standard for array provider and consumer libraries. Now we will describe in detail about how we went through the process of implementing this in forks of SciPy, scikit-learn and scikit-image.
The Array API defines a method named
__array_namespace__ which returns
the array API module with all the array API functions accessible from it.
It has a couple of key benefits:
- for the array consumer libraries to be able to check if the incoming array supports the standard.
- to access all the functions that work with the array object without having the need to import the Python module explicitly.
NEP 47 recommends defining a utility function named
get_namespace to check
for such an attribute.
An example of potential implementation of
get_namespace function is defined as follows:
Workflow - for using
Now let's walk through an example to see how this works in practice. Consider the following function from SciPy:
This function is taken from
scipy/sparse/_sputils.py. We can see that this
function explicitly calls
np.array on the given array object:
realising the type of
obj. This implementation is obviously not generic to
support any array library other than NumPy. We'll now apply the Array API's
get_namespace magic to make it generic, by adding a couple of lines to the code above:
Now you can see that
get_namespace finds the Array API standard-compatible
namespace for the given object and calls
.asarray on it instead of
np.array. Hence making it generic for any array provider library that
supports the standard.
Workflow for using the Array API Standard for writing code that works with multiple libraries
Applying it to our demo example
Now let's take a look at the code for our demo example (segmentation of the picture of greek coins in regions) to understand what it takes to make it run on GPU(s).
Let's walk through the code above to get a sense of what we're trying to achieve here.
- We have a dataset of Greek coins from Pompeii (imported from
- We rescale the image to 20% to speed up processing
- Convert the image to a graph data structure
- Apply spectral clustering on the graph (via
- Plot the resulting clustering
Now to make it run on GPU we need to make minor changes to this demo code (as well as to each of the libraries involved - see the next section). We need to use a different array input for a GPU array. Since NumPy is a CPU only library we'll use CuPy, which is a NumPy/SciPy-compatible array library for GPU-accelerated computing with Python. In addition, we made the rescaling factor variable to be able to perform performance benchmarking as a function of input data size.
Changes to SciPy, scikit-learn, scikit-image, and CuPy
Since SciPy, scikit-learn, scikit-image are designed mainly for NumPy, we need
to use the Array API standard to dispatch to CuPy or NumPy based on the input.
We do that by changing all the functions that are called in the above demo to
get_namespace, and to replace all the instances of
properly dispatch to the appropriate array library. This is straightforward
in some places, and not so trivial in others. We'll discuss the more
challenging cases in the next section.
The final code for the demo can be seen here.
The changes to SciPy, scikit-learn, scikit-image and CuPy can be seen here:
The notebook for the demo can be seen here.
Challenges and workarounds
At the moment the Array API standard specifies a lot of the most commonly used functions in the NumPy API. This makes it easier to replace NumPy functions with equivalent ones in the standard. There are still some rough edges though, which makes the process a bit harder. These are the three main issues we faced in the process:
1. Compiled code
There is a lot of functionality in SciPy and the Scikits that is not pure Python.
Performance-critical parts are written using compiled code extensions. These
parts are dependent on NumPy's internal memory representation, which makes them
CPU only. This prevents using the Array API standard to execute those functions
with CuPy. There is another project,
uarray, for dispatching compiled code
which needs to be used along with the Array API standard to make it fully
See for instance following piece of code:
sparse.coo_matrix is a C++ extension in the
scipy.sparse module. We need to
cupyx.scipy.sparse.coo_matrix for CuPy arrays and
scipy.sparse.coo_matrix for NumPy
arrays. This is not possible via
get_namespace. For this demo we have
added a workaround, until we have implemented the support for
backends in SciPy. The workaround looks something like this for CuPy:
Note that this workaround is only valid for NumPy/CuPy arrays.
2. Not implemented functions
We encountered a number of NumPy functions used by our Greek coins demo which didn't have a direct equivalent in the Array API standard. This was mostly due to the fact that those functions are not generic enough to be defined in the Array API. Some examples of unimplemented functions that we encountered during this exercise:
For some of these the solution was to reimplement an equivalent function, for
the remaining ones the workaround was using the library-specific NumPy/CuPy
APIs before converting back to the Array API standard-compliant
3. Inconsistencies between the NumPy API and the Array API standard
Although the standard has managed to get good coverage of the NumPy API, there
are places where there are inconsistencies between the two. This may be either
function behaviour that is missing or different, or function or parameter names
don't match (e.g.,
concat in the standard instead of
Here is an example of a naming inconsistency between NumPy and the Array API standard:
And here is an example of a behavioral inconsistency for an indexing operation:
Each of the inconsistencies can be worked around. Function renames may require some detective work by referring to the Array API standard documentation, but are typically easy code changes. Differences in behavior can be harder to deal with at times. The Array API standard aims to provide a set of functionality that's orthogonal but complete - so if it's possible to perform some array operation on accelerator devices, then there's typically a way to access it in the standard. In a few cases, integer indexing being one, work is still ongoing to extend the standard and we may open a feature request or contribute to an existing discussion while adding a workaround to our own code.
After dealing with all the challenges above, our demo works with CuPy arrays as input. In this section, we will talk about about its performance on GPU vs. CPU.
We ran this on AMD GPUs as well as NVIDIA GPUs. Below we show plots comparing the runtime of the demo code on CPU and GPU as a function of input array size. The x-axis of the plot is the image dimension, i.e the dimension of the original image after rescaling it by a constant factor along both dimensions. For this demo the rescaling factors were: 0.1, 0.2, 0.4, 0.6, 0.8 and 1.
On AMD GPU:
This was run on a server with a AMD Instinct MI50 GPU with 32 GB of memory, and a CPU with 64 physical cores.
On NVIDIA GPU:
This was run on a NVIDIA TITAN RTX GPU with 24 GB of memory, and a CPU with 24 phyiscal cores.
The GPU vs CPU plots show what you would expect: the computation is faster with CuPy (i.e. on GPU) compared to NumPy (i.e. on CPU) for larger image sizes. The computation on GPU is slower than on CPU for image sizes less than 61 x 77. This is due to the overhead of moving data to the device (GPU), which is significant compared to the time taken for actual computation. The larger the image, the longer the computation takes, and hence the less important the data transfer becomes and the larger the performance improvement from using a GPU.
Reproducing these results
The code and instructions to run the analysis above can be found in this repository: Quansight-Labs/array-api-gpu-demo.
The above analysis required changes in the following libraries:
The changes are present in the
array-api-gpu-demo branch of @aktech's fork
of the above projects.
In the above mentioned repo, you can find the code with which we created Docker images for both NVIDIA & AMD platforms. The README also contains instructions for how to obtain the built Docker images from Docker Registry. Once you are inside the Docker container, running the demo is as simple as running:
Running the ROCm container on a machine with an AMD GPU is done like this:
At the moment we have only modified the functions we needed to be able to run our demo on GPU. There is of course a lot more functionality that can eventually be ported to support the Array API standard. The eventual goal of that is to make array-consuming libraries like SciPy, scikit-learn, and scikit-image fully compatible with multiple array libraries. A lot of work has been done in that direction already - but that's for a future post to talk about!
This demo work was part of a larger project to make PyData libraries work with multiple array libraries and thereby run on GPUs. This is a significant undertaking, and this work would not have been possible without funding from AMD for a team at Quansight Labs.