Refining NumPy's Python API for its 2.0 release
Published November 8, 2023
mtsokol
Mateusz Sokół
To ensure the vitality of well-established libraries, periodic cleanups play an important role in maintenance efforts. This is also the case for NumPy, which plays a central role in the Scientific Python ecosystem. In this blog post, I describe the purpose and key achievements of the NEP 52 workstream, which aimed to clean up NumPy's Python API.
Hi! I'm Mateusz Sokół and for the last three months, I've had the pleasure of participating in the Quansight Labs internship program and contributing in multiple ways to the Scientific Python ecosystem. My mentors were Ralf Gommers and Nathan Goldbaum who supervised my work and assisted with planning my tasks.
During the program, I mainly focused on working on one of the Numpy Enhancement Proposals (NEP), specifically NEP 52. This proposal, with its name "Python API cleanup for NumPy 2.0", was nondistinctive at first glance, but once explored it had a multitude of interesting technical challenges and required to make impactful design decisions. It was a solid introduction to the internals of NumPy and closely related scientific libraries. This NEP also allowed me to collaborate with many engineering teams from other libraries, such as JAX, pybind11, and pandas.
In this short article, I would like to briefly outline the motivation behind NEP 52 and its scope. I will then take this opportunity to explain some of the technical issues we solved, covering both Python and C code.
As of this moment, NEP 52 has been accepted and most of the goals have been achieved, but the work is still ongoing!
NumPy, briefly
Numpy is a fundamental package for scientific computing used by scientists and engineers around the world. It emerged from numerical libraries, namely Numeric and Numarray, in the early 2000s. It allows the manipulation of high-dimensional arrays implemented in C while staying in a comfortable Python realm. Python enables rapid prototyping without needless hurdles, while providing an enormous ecosystem for building reliable and production-level software. These unique traits resulted in worldwide adoption of NumPy in many scientific and engineering domains.
Additionally, what makes NumPy stand out from commercial numerical software products, such as MATLAB, is the fact that NumPy is fully open-source software distributed under the BSD 3-Clause license.
NEP 52 - motivation & scope
Over the course of two decades NumPy's Python API has continually evolved and adapted to the evolution of the Python language. As we all know, over time, software often becomes obsolete and public APIs grow larger as new features arrive.
NEP 52 was meant to identify obsolete, duplicated, and deprecated members of Python API and remove/rearrange them. This NEP had a few principles in mind:
- Each public function should be available from only one place.
- Redundant or misleading aliases for dtypes and functions should be removed.
- There should be a clear distinction between what is private and what is public (the concept of a "semi-private" API member should be avoided).
- Concretely define the NumPy API and remove internal usages of
import *
.
The desired result was to end up with a well-defined, unambiguous public API, that is easy for learning and searching through it, with ideally only one way to do a specific thing.
Changes that we merged vary in terms of disruption - from aliases removal, which can be fixed by a script, to substantial changes, such as a package name change. The top-level list of changes is:
- Clarified NumPy's submodule structure and made all submodules accessible through lazy imports,
- Cleaned
numpy.lib
namespace and establish well-defined submodules for it, - Settled on canonical data type names and documented those,
- Removed redundant aliases, such as these that point to
np.inf
:np.Infinity
,np.Inf
,np.INF
andnp.infty
, - Removed niche functions and internal constants from the main namespace,
- Rename
numpy.core
tonumpy._core
to clearly indicate that this is a private and internal submodule, - Remove niche and misleading data type aliases, such as
int0
,uint0
,float_
.
To ensure a smooth migration to NumPy 2.0 we provide several areas where changes are communicated/addressed:
- Clear and succinct release notes for each relevant PR.
- Migration guide containing all changes with migration instructions (what has changed and how it should be addressed in the codebase).
- Meaningful error messages and deprecation warnings, which can also provide migration instructions.
- Tool for automatic application of changes (originally a
sed
script was considered, but eventually a newruff
rule was implemented).
Selected achievements & technical challenges
In this section, I will discuss significant milestones and more notable technical challenges that we tackled along the way.
Cleaning up the main NumPy namespace included identifying "keep" and "remove" lists.
It took quite a few iterations to determine them and clear out the "tentative" list.
The main namespace members removed were mainly: internal enums (likely exposed by
accident), aliases for already existing constants and functions, already deprecated
items, and functions that were moved to other submodules. We managed to reduce the number
of entries in np.*
by over 80:
>>> import numpy as np>>> np.__version__'1.26.1'>>> len(dir(np))594
And now:
>>> import numpy as np>>> len(dir(np))511
Each removed item was replaced by an AttributeError
that contains a migration
guideline for end users:
>>> np.byte_boundsTraceback (most recent call last):...AttributeError: `np.byte_bounds` was removed in the NumPy 2.0 release. Now it's available under `np.lib.array_utils.byte_bounds`
Guaranteeing backward compatibility
We've already touched on several types of Python API changes, such as removing aliases, adding a new member to a namespace, or moving an existing item to a new location. Each one of them have different repercussions in downstream libraries that heavily depend on NumPy.
Downstream libraries that required modifications to adjust to the API changes were: SciPy, Matplotlib, pandas, JAX, scikit-learn and CuPy. Also, there were some one-time contributions to pybind11, joblib, and hypothesis libraries. For each one of them we had to ensure that we do not narrow applicable NumPy versions down. We had to maintain a compatibility with them:
Backward compatibility means that downstream library containing code written for
previous NumPy versions can be executed with a newer version - this type of compatibility
is the most desirable because it allows downstream users to just "bump" dependency
version number in e.g. pyproject.toml
and call it a day.
Forward compatibility makes it possible to write code with a new NumPy version and execute it with a previous one. This is often being broken with any new API entry as, by definition, this entry isn't available in the old versions.
In our efforts we paid attention to backward compatibility. Libraries, such as SciPy, run CI stages with both stable and nightly NumPy releases. It is required that we continue to support several NumPy releases back, given the latest version of SciPy.
A backward incompatible change required us to, e.g., branch on the dependency version:
AxisError: Type[Exception]if np.lib.NumpyVersion(np.__version__) >= "1.25.0": from numpy.exceptions import AxisErrorelse: from numpy import AxisError
Clearing out the numpy.lib
namespace
One of the goals of NEP 52 was to enforce each function/constant to only be available
from one location, if possible. It especially concerns numpy.lib
, whose
contents were almost fully exported to the main namespace. We did an analysis of
the module and split its members into:
- Main namespace exports - members exported to the
numpy
namespace. - Local exports - members available only from the
numpy.lib
ornumpy.lib.<submodule>
.
Members exported to the main namespace ended up in private files, such as
numpy.lib._array_utils_impl
, whereas ones exported locally received dedicated
submodules, e.g. numpy.lib.array_utils
.
As a result, the number of members in numpy.lib
reduced from:
>>> import numpy as np>>> np.__version__'1.26.1'>>> len([s for s in dir(np.lib) if not s.startswith('_')])192
To:
>>> import numpy as np>>> len([s for s in dir(np.lib) if not s.startswith('_')])13
Now numpy.lib
hosts only a handful of functions and submodules with well-defined
purposes.
Renaming numpy.core
to numpy._core
In my opinion, the most challenging task was renaming the numpy.core
submodule to
numpy._core
. numpy/core
contained most of NumPy's source code, and this change
affected both downstream libraries and NumPy-internal C code.
In terms of C-level code, it is worth emphasizing that not only does Python use C
functions through compiled extension modules, but also C code in NumPy accesses
functionality implemented in Python directly via PyImport_XXX
.
For the source code that is compiled into extension modules, we only had to rename
core
imports to _core
. A more complex issue appeared for header files that are
included in third party objects. We wanted to make sure that library/executable
compiled with numpy 2.0 will work with numpy 1.x installed (and vice versa). For this
purpose, we used two simple mechanisms:
Built with numpy 1.x | Built with numpy 2.0 | |
---|---|---|
numpy 1.x installed | standard execution | falls back to second C's import |
numpy 2.0 installed | uses numpy.core stub | standard execution |
Stubs - After renaming numpy.core
submodule to numpy._core
we did not completely
remove the former. Instead, we replaced it with a stub module that replicates
numpy._core
by importing it, but also generates an appropriate warning when accessed.
This was necessary to ensure that objects compiled with numpy 1.x will continue to work
with the numpy 2.0 release.
Fallback imports - An object compiled with numpy 2.0 headers should anticipate different versions of locally available NumPy in the runtime. Therefore, a simple "import fallback" mechanism has been implemented to cover both cases:
PyObject *numpy = PyImport_ImportModule("numpy._core._multiarray_umath");if (numpy == NULL && PyErr_ExceptionMatches(PyExc_ModuleNotFoundError)) { PyErr_Clear(); numpy = PyImport_ImportModule("numpy.core._multiarray_umath");}if (numpy == NULL) { PyErr_SetString(PyExc_ImportError, "_multiarray_umath failed to import"); return -1;}
Pickle files compatibility - Inside pickled arrays there's a numpy.core.multiarray._reconstruct
path which has changed with the rename. To make old pickles loadable for NumPy 2.0
we rely on numpy/core
stubs to ensure that relevant functions can be imported.
To make NumPy 2.0 pickles also loadable for NumPy 1.26.x, we backported numpy/_core
stubs to the maintenance branch. As a result, pickle files can be used without worrying
about NumPy or pickle versions.
Other libraries have also followed our lead and started working on renaming core
submodule to _core
, such as Pandas - (Pandas PR).
Data type aliases analysis
NEP 52 was also an opportunity to spend some time revisiting dtype aliases that are
available in the main namespace np.*
(1) and through np.dtype(...)
(2).
The debate on the final form is still ongoing (specifically np.int_
and np.uint
) but
there are several groups of names that are available through (1) or (2):
- words:
float
,cdouble
,uint
,bool
,object
,long
... - words+bits:
int16
,uint8
,float64
, ... - symbols:
p
,L
,i
, ... - symbols+bytes:
c8
,i1
,i2
, ... - abstract types:
numeric
,inexact
,integer
, ...
The canonical names, that are available in the main namespace and should be used as a first choice, are "words+bits". These names are unambiguous about which type we're referring to, and we make an explicit declaration about precision, leaving no room for platform-specific behavior. "Words" that refer to C types should be used when interacting with C code outside of NumPy.
A Ruff plugin for NumPy 2.0 migration rules
Ruff is a new Python linter that can outperform other well-known linting tools. During one of the discussions on Scientific Python Discord server, we came up with the idea of writing a dedicated Ruff rule to address NumPy 2.0 changes. The rule offers an automated way to fix a large part of changes - migrating to retained aliases or flagging lines that require manual intervention to be compatible with NumPy 2.0.
This assignment gave me the opportunity to write a bit of Rust code, since it is the core language for Ruff. As of today, the PR is merged, but the rule is still available in the "preview" mode only. In the local setup it managed to correctly fix over a hundred of lines of code for the latest SciPy release source code.
The command for running the Ruff linter:
ruff check scipy/ --no-cache --fix
And here's a small passage from git diff
:
- a = toarray(a, dtype=np.float_)+ a = toarray(a, dtype=np.float64)...- elif (b == Inf and a == -Inf):+ elif (b == np.inf and a == -np.inf):...- math_dtypes = [np.int_, np.float_, np.complex_]+ math_dtypes = [np.int_, np.float64, np.complex128]
Ruff rules open more robust ways to apply systematic changes to large codebases, compared to regular expression searches, because they offer AST analysis rather than a text-based search only.
Wrapping up
My development and maintenance efforts on NumPy will continue, most notably around Array API Standard support, where full compatibility is still being implemented. The release of NumPy 2.0 is planned for early next year, and only then will the official adoption of the new, major version begin.
Acknowledgements
I would like to thank my mentors Ralf Gommers and Nathan Goldbaum for their advice and guidance during the whole internship, Melissa Weber Mendonça for organising and conducting intern cohort meetings, and Sebastian Berg for PR reviews and explaining NumPy internals. The time spent on NEP 52 was a perfect primer to the Scientific Python ecosystem!
I look forward to continuing working on NumPy and other libraries within the community!