Using Hypothesis to test array-consuming libraries

Hypothesis logo accompanied by the text "Property-based testing for the Array API"

Over the summer, I've been interning at Quansight Labs to develop testing tools for the developers and users of the upcoming Array API standard. Specifically, I contributed "strategies" to the testing library Hypothesis, which I'm excited to announce are now available in hypothesis.extra.array_api. Check out the primary pull request I made for more background.

This blog post is for anyone developing array-consuming methods (think SciPy and scikit-learn) and is new to property-based testing. I demonstrate a typical workflow of testing with Hypothesis whilst writing an array-consuming function that works for all libraries adopting the Array API, catching bugs before your users do.

Before we begin

Hypothesis shipped with its Array API strategies in version 6.21. We also need to use NumPy >= 1.22 so that we can test with its recently merged Array API implementation—this hasn't been released just yet, so I would recommend installing a nightly build.

I will be using the excellent ipytest extension to nicely run tests in Jupyter as if we were using pytest proper. For pretty printing I use the superb Rich library, where I simply override Python's builtin print with rich.print. I also suppress all warnings for convenience's sake.

In [1]:
%%capture
!pip install hypothesis>=6.21
!pip install -i https://pypi.anaconda.org/scipy-wheels-nightly/simple numpy
In [2]:
%%capture
!pip install ipytest
import ipytest; ipytest.autoconfig(display_columns=80)
In [3]:
%%capture
!pip install rich
from rich import print
In [4]:
import warnings; warnings.filterwarnings("ignore")

What the Array API enables

The API standardises functionality of array libraries, which has numerous benefits for both developers and users. I recommend reading the Data APIs announcement post to get a better idea of how the API is being shaped, but for our purposes it works an awful lot like NumPy.

The most exciting prospect for me is being able to easily write an array-consuming method that works with all the adopting libraries. Let's try writing this method to calculate the cumulative sums of an array:

In [5]:
def cumulative_sums(x):
    """Return the cumulative sums of the elements of the input."""
    xp = x.__array_namespace__()
    
    result = xp.empty(x.size, dtype=x.dtype)
    result[0] = x[0]
    for i in range(1, x.size):
        result[i] = result[i - 1] + x[i]
        
    return result

The all-important __array_namespace__() method allows array-consuming methods to get the array's respective Array API module. Conventionally we assign it to the variable xp.

From there you just need to rely on the guarantees of the Array API to support NumPy, TensorFlow, PyTorch, CuPy, etc. all in one simple method!

Good ol' unit tests

I hope you'd want write some tests at some point 😉

We can import NumPy's Array API implementation and test with that for now, although in the future it'd be a good idea to try other implementations (see related Hypothesis issue). We don't import numpy as np, but instead import NumPy's new module numpy.array_api, which exists to comply with the Array API standard where numpy proper can not (namely so NumPy can keep backwards compatibility).

In [6]:
from numpy import array_api as nxp

def test_cumulative_sums():
    x = nxp.asarray([0, 1, 2, 3, 4])
    assert nxp.all(cumulative_sums(x) == nxp.asarray([0, 1, 3, 6, 10]))
    
ipytest.run()
.                                                                        [100%]
1 passed in 0.02s

I would probably write a parametrized test here and write cases to cover all the interesting scenarios I can think of. Whatever we do, we will definitely miss some edge cases. What if we could catch bugs we would never think of ourselves?

Testing our assumptions with Hypothesis

Hypothesis is a property-based testing library. To lift from their excellent docs, think of a normal unit test as being something like the following:

  1. Set up some data.
  2. Perform some operations on the data.
  3. Assert something about the result.

Hypothesis lets you write tests which instead look like this:

  1. For all data matching some specification.
  2. Perform some operations on the data.
  3. Assert something about the result.

You almost certainly will find new bugs with Hypothesis thanks to how it cleverly fuzzes your specifications, but the package really shines in how it "reduces" failing test cases to present only the minimal reproducers that trigger said bugs. This demo will showcase both its power and user-friendliness.

Let's try testing a simple assumption that we can make about our cumulative_sums() method:

For an array with positive elements, its cumulative sums should only increment or remain the same per step.

We can write a simple enough Hypothesis-powered test method for this:

In [7]:
from hypothesis import given
from hypothesis.extra.array_api import make_strategies_namespace

xps = make_strategies_namespace(nxp)

@given(xps.arrays(dtype="uint8", shape=10))
def test_positive_arrays_have_incrementing_sums(x):
    a = cumulative_sums(x)
    assert nxp.all(a[1:] >= a[:-1])

As the Array API tools provided by Hypothesis are agnostic to the adopting array/tensor libraries, we first need to bind an implementation via make_strategies_namespace(). Passing numpy.array_api will give us a SimpleNamespace to use these tools for NumPy's Array API implementation.

The @given() decorator tells Hypothesis what values it should generate for our test method. In this case xps.arrays() is a "search strategy" that specifies Array API-compliant arrays from numpy.array_api should be generated.

In this case, shape=10 specifies the arrays generated are 1-dimensional and of size 10, and dtype="uint8" specifies they should contain unsigned integers (which is handy for our test method as uints are always positive). Let's quickly see a small sample of the arrays Hypothesis can generate:

In [8]:
for _ in range(10):
    x = xps.arrays(dtype="uint8", shape=10, unique=True).example()
    print(repr(x))
print("...")
Array([239, 211, 226, 129,  31,  13,  80, 235, 254, 163], dtype=uint8)
Array([164, 175, 254, 111,  63, 241,  64, 201, 173, 117], dtype=uint8)
Array([106, 149, 210, 230,  58,  37,  66, 153, 203, 181], dtype=uint8)
Array([ 93,   0, 254, 253, 252, 251, 250, 249, 248, 247], dtype=uint8)
Array([ 16,   0, 254, 253, 252, 251, 250, 249, 248, 247], dtype=uint8)
Array([172,   0, 254, 253, 252, 251, 250, 249, 248, 247], dtype=uint8)
Array([129,   0, 254, 253, 252, 251, 250, 249, 248, 247], dtype=uint8)
Array([111,   0, 254, 253, 252, 251, 250, 249, 248, 247], dtype=uint8)
Array([ 67,   0, 254, 253, 252, 251, 250, 249, 248, 247], dtype=uint8)
Array([  0, 255, 254, 253, 252, 251, 250, 249, 248, 247], dtype=uint8)
...

How Hypothesis "draws" from its strategies can look rather unremarkable at first. A small sample of draws might look fairly uniform but trust that strategies will end up covering all kinds of edge cases. Importantly it will cover these cases efficiently so that Hypothesis-powered tests are relatively quick to run on your machine.

All our test method does is get the cumulative sums array a that is returned from cumulative_sums(x), and then check that every element a[i] is greater than or equal to a[i-1].

Time to run it!

In [9]:
ipytest.run("-k positive_arrays_have_incrementing_sums", "--hypothesis-seed=3")
F                                                                        [100%]
=================================== FAILURES ===================================
_________________ test_positive_arrays_have_incrementing_sums __________________

    @given(xps.arrays(dtype="uint8", shape=10))
>   def test_positive_arrays_have_incrementing_sums(x):

<cell>:7: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

x = Array([26, 26, 26, 26, 26, 26, 26, 26, 26, 26], dtype=uint8)

    @given(xps.arrays(dtype="uint8", shape=10))
    def test_positive_arrays_have_incrementing_sums(x):
        a = cumulative_sums(x)
>       assert nxp.all(a[1:] >= a[:-1])
E       assert Array(False, dtype=bool)
E        +  where Array(False, dtype=bool) = <function all at 0x7f2d48cc2430>(Array([ 52,  78, 104, 130, 156, 182, 208, 234,   4], dtype=uint8) >= Array([ 26,  52,  78, 104, 130, 156, 182, 208, 234], dtype=uint8))
E        +    where <function all at 0x7f2d48cc2430> = nxp.all

<cell>:9: AssertionError
---------------------------------- Hypothesis ----------------------------------
Falsifying example: test_positive_arrays_have_incrementing_sums(
    x=Array([26, 26, 26, 26, 26, 26, 26, 26, 26, 26], dtype=uint8),
)
=========================== short test summary info ============================
FAILED <cell>::test_positive_arrays_have_incrementing_sums - assert A...
1 failed, 1 deselected in 0.17s

Hypothesis has tested our assumption and told us we're wrong. It provides us with the following falsifying example:

>>> x = xp.full(10, 26, dtype=xp.uint8)
>>> x
Array([ 26,  26,  26,  26,  26,  26,  26,  26,  26,  26], dtype=uint8)
>>> cumulative_sums(x)
Array([ 26,  52,  78, 104, 130, 156, 182, 208, 234,   4], dtype=uint8)

You can see that an overflow error has occurred for the final cumulative sum, as 234 + 26 (260) cannot be represented in 8-bit unsigned integers.

Let's try promoting the dtype of the cumulative sums array so that it can represent larger numbers, and then we can run the test again.

In [10]:
def max_dtype(xp, dtype):
    if dtype in [getattr(xp, name) for name in ("int8", "int16", "int32", "int64")]:
        return xp.int64
    elif dtype in [getattr(xp, name) for name in ("uint8", "uint16", "uint32", "uint64")]:
        return xp.uint64
    else:
        return xp.float64

def cumulative_sums(x):
    xp = x.__array_namespace__()
    
    result = xp.empty(x.size, dtype=max_dtype(xp, x.dtype))
    result[0] = x[0]
    for i in range(1, x.size):
        result[i] = result[i - 1] + x[i]
        
    return result

ipytest.run("-k positive_arrays_have_incrementing_sums")
.                                                                        [100%]
1 passed, 1 deselected in 0.18s

You can see another assumption about our code is:

We can find the cumulative sums of arrays of any scalar dtype.

We should cover this assumption in our test method test_positive_arrays_have_incrementing_sums by passing child search strategies into our xps.arrays() parent strategy. Specifying dtype as xps.scalar_dtypes() will tell Hypothesis to generate arrays of all scalar dtypes. To specify that these array values should be positive, we can just pass keyword arguments to the underlying value generating strategy xps.from_dtype() via elements={"min_value": 0}.

And while we're at it, let's make sure to cover another assumption:

We can find the cumulative sums of arrays with multiple dimensions.

Specifying shape as xps.array_shapes() will tell Hypothesis to generate arrays of various dimensionality and sizes. We can filter this strategy with lambda s: prod(s) > 1 so that always x.size > 1, allowing our test code to still work.

In [11]:
from math import prod
from hypothesis import settings

@given(
    xps.arrays(
        dtype=xps.scalar_dtypes(),
        shape=xps.array_shapes().filter(lambda s: prod(s) > 1),
        elements={"min_value": 0},
    )
)
def test_positive_arrays_have_incrementing_sums(x):
    a = cumulative_sums(x)
    assert nxp.all(a[1:] >= a[:-1])
    
ipytest.run("-k positive_arrays_have_incrementing_sums", "--hypothesis-seed=3")
F                                                                        [100%]
=================================== FAILURES ===================================
_________________ test_positive_arrays_have_incrementing_sums __________________

    @given(
>       xps.arrays(
            dtype=xps.scalar_dtypes(),
            shape=xps.array_shapes().filter(lambda s: prod(s) > 1),
            elements={"min_value": 0},
        )
    )
E   hypothesis.errors.MultipleFailures: Hypothesis found 2 distinct failures.

<cell>:5: MultipleFailures
---------------------------------- Hypothesis ----------------------------------
Falsifying example: test_positive_arrays_have_incrementing_sums(
    x=Array([[False, False]], dtype=bool),
)
TypeError: only size-1 arrays can be converted to Python scalars

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  <cell>, line 12, in test_positive_arrays_have_incrementing_sums
    a = cumulative_sums(x)
  <cell>, line 13, in cumulative_sums
    result[0] = x[0]
  File "<env>/numpy/array_api/_array_object.py", line 657, in __setitem__
    self._array.__setitem__(key, asarray(value)._array)
ValueError: setting an array element with a sequence.

Falsifying example: test_positive_arrays_have_incrementing_sums(
    x=Array([False, False], dtype=bool),
)
Traceback (most recent call last):
  <cell>, line 12, in test_positive_arrays_have_incrementing_sums
    a = cumulative_sums(x)
  <cell>, line 15, in cumulative_sums
    result[i] = result[i - 1] + x[i]
  File "<env>/numpy/array_api/_array_object.py", line 362, in __add__
    other = self._check_allowed_dtypes(other, "numeric", "__add__")
  File "<env>/numpy/array_api/_array_object.py", line 125, in _check_allowed_dtypes
    raise TypeError(f"Only {dtype_category} dtypes are allowed in {op}")
TypeError: Only numeric dtypes are allowed in __add__
=========================== short test summary info ============================
FAILED <cell>::test_positive_arrays_have_incrementing_sums - hypothes...
1 failed, 1 deselected in 0.38s

Again Hypothesis has proved our assumptions wrong, and this time it's found two problems.

Firstly, our cumulative_sums() method doesn't adjust for boolean arrays, so we get an error when we add two bool values together.

>>> x = xp.zeros(2, dtype=xp.bool)
>>> x
Array([False, False], dtype=bool)
>>> cumulative_sums(x)
Traceback:
  <cell>, line 15, in cumulative_sums
    result[i] = result[i - 1] + x[i]
  ...
TypeError: Only numeric dtypes are allowed in __add__

Secondly, our cumulative_sums() method is assuming arrays are 1-dimensional, so we get an error when we wrongly assume x[0] will always return a single scalar (technically a 0-dimensional array).

>>> x = xp.zeros((1, 2), dtype=xp.bool)
>>> x
Array([[False, False]], dtype=bool)
>>> cumulative_sums(x)
Traceback:
  <cell>, line 13, in cumulative_sums
    result[0] = x[0]
  ...
TypeError: only size-1 arrays can be converted to Python scalars

I'm going to flatten input arrays and convert the boolean arrays to integer arrays of ones and zeros. Of-course we'll run the test again to make sure our updated cumulative_sums() method now works.

In [12]:
def cumulative_sums(x):
    xp = x.__array_namespace__()
    
    x = xp.reshape(x, x.size)
    
    if x.dtype == xp.bool:
        mask = x
        dtype = xp.uint64
        x = xp.zeros(x.shape, dtype=xp.uint64)
        x[mask] = 1
        
    result = xp.empty(x.size, dtype=max_dtype(xp, x.dtype))
    result[0] = x[0]
    for i in range(1, x.size):
        result[i] = result[i - 1] + x[i]
        
    return result

ipytest.run("-k positive_arrays_have_incrementing_sums", "--hypothesis-seed=3")
F                                                                        [100%]
=================================== FAILURES ===================================
_________________ test_positive_arrays_have_incrementing_sums __________________

    @given(
>       xps.arrays(
            dtype=xps.scalar_dtypes(),
            shape=xps.array_shapes().filter(lambda s: prod(s) > 1),
            elements={"min_value": 0},
        )
    )

<cell>:5: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

x = Array([4611686018427387904, 4611686018427387904], dtype=int64)

    @given(
        xps.arrays(
            dtype=xps.scalar_dtypes(),
            shape=xps.array_shapes().filter(lambda s: prod(s) > 1),
            elements={"min_value": 0},
        )
    )
    def test_positive_arrays_have_incrementing_sums(x):
        a = cumulative_sums(x)
>       assert nxp.all(a[1:] >= a[:-1])
E       assert Array(False, dtype=bool)
E        +  where Array(False, dtype=bool) = <function all at 0x7f2d48cc2430>(Array([-9223372036854775808], dtype=int64) >= Array([4611686018427387904], dtype=int64))
E        +    where <function all at 0x7f2d48cc2430> = nxp.all

<cell>:13: AssertionError
---------------------------------- Hypothesis ----------------------------------
Falsifying example: test_positive_arrays_have_incrementing_sums(
    x=Array([4611686018427387904, 4611686018427387904], dtype=int64),
)
=========================== short test summary info ============================
FAILED <cell>::test_positive_arrays_have_incrementing_sums - assert A...
1 failed, 1 deselected in 1.24s

We resolved our two previous issues... but Hypothesis has found yet another failing scenario 🙃

>>> x = xp.full(2, 4611686018427387904, dtype=xp.int64)
>>> x
Array([ 4611686018427387904,  4611686018427387904], dtype=int64)
>>> cumulative_sums(x)
Array([ 4611686018427387904, -9223372036854775808], dtype=int64)

An overflow has occurred again, which we can't do much about it this time. There's no larger signed integer dtype than int64 (in the Array API), so we'll just have cumulative_sums() detect overflows itself.

In [13]:
def cumulative_sums(x):
    xp = x.__array_namespace__()
    
    x = xp.reshape(x, x.size)
    
    if x.dtype == xp.bool:
        mask = x
        dtype = xp.uint64
        x = xp.zeros(x.shape, dtype=xp.uint64)
        x[mask] = 1
        
    result = xp.empty(x.size, dtype=max_dtype(xp, x.dtype))
    result[0] = x[0]
    for i in range(1, x.size):
        result[i] = result[i - 1] + x[i]
        if result[i] < result[i - 1]:
            raise OverflowError("Cumulative sum cannot be represented")
        
    return result

If Hypothesis generates arrays which raise OverflowError, we can just catch it and use assume(False) to ignore testing these arrays on runtime. This "filter-on-runtime" behaviour can be very handy at times, although their docs note assume() can be problematic.

We can also explicitly cover overflows in a separate test.

In [14]:
from hypothesis import assume
import pytest

@given(
    xps.arrays(
        dtype=xps.scalar_dtypes(),
        shape=xps.array_shapes().filter(lambda s: prod(s) > 1),
        elements={"min_value": 0},
    )
)
def test_positive_arrays_have_incrementing_sums(x):
    try:
        a = cumulative_sums(x)
        assert nxp.all(a[1:] >= a[:-1])
    except OverflowError:
        assume(False)
    
def test_error_on_overflow():
    x = nxp.asarray([nxp.iinfo(nxp.uint64).max, 1], dtype=nxp.uint64)
    with pytest.raises(OverflowError):
        cumulative_sums(x)

ipytest.run()
...                                                                      [100%]
3 passed in 0.27s

Our little test suite finally passes 😅

If you're feeling adventurous, you might want to get this very notebook running and see if you can write some test cases yourself—bonus points if they fail! For starters, how about testing that cumulative sums decrease with arrays containing negative elements?

When you're developing an Array API array-consuming method, and an equivalent method already exists for one of the adopting libraries, I highly recommend using Hypothesis to compare its results to your own. For example, we could use the battle-tested np.cumsum() to see how our cumulative_sums() method compares:

In [15]:
import numpy as np

@given(xps.arrays(dtype=xps.scalar_dtypes(), shape=xps.array_shapes()))
def test_reference_implementation(x):
    our_out = cumulative_sums(x)
    # We convert numpy.array_api arrays into NumPy's top-level "ndarray"
    # structure, so we can compare our results with np.cumsum().
    # We do this via np.asarray(obj), which will see if obj supports
    # NumPy's interface protocol to subsequently get the underlying
    # ndarray from obj - fortunately arrays generated from
    # numpy.array_api do support this!
    # See https://numpy.org/devdocs/user/basics.interoperability.html
    our_out = np.asarray(our_out)
    their_out = np.cumsum(np.asarray(x))
    assert np.all(our_out == their_out)

Such "differential testing" is a great exercise to really think about what your code does, even if ultimately you conclude that you are happy with different results.

Zac Hatfield-Dodds, who maintains Hypothesis, writes more about differential testing in their short paper "Falsify your Software: validating scientific code with property-based testing". Generally it's a great read if you want more ideas on how Hypothesis can make your array-consuming libraries robust!

Watch this space

This year should see a first version of the Array API standard, and subsequently NumPy shipping numpy.array_api out to the world—this means array-consuming libraries will be able to reliably develop for the Array API quite soon. I hope I've demonstrated why you should try Hypothesis when the time comes 🙂

Good news is that I'm extending my stay at Quansight. My job is to help unify Python's fragmented scientific ecosystem, so I'm more than happy to respond to any inquiries about using Hypothesis for the Array API via email or Twitter.

For now I'm contributing to the Hypothesis-powered Array API compliance suite, which is already being used by the NumPy team to ensure numpy.array_api actually complies with every tiny detail of the specification. This process has the added side-effect of finding limitations in hypothesis.extra.array_api, so you can expect Hypothesis to only improve from here on out!

Comments