Read Our Blog

A vision for extensibility to GPU & distributed support for SciPy, scikit-learn, scikit-image and beyond

Over the years, array computing in Python has evolved to support distributed arrays, GPU arrays, and other various kinds of arrays that work with specialized hardware, or carry additional metadata, or use different internal memory representations. The foundational library for array computing in the PyData ecosystem is NumPy. But NumPy alone is a CPU-only library - and a single-threaded one at that - and in a world where it's possible to get a GPU or a CPU with a large core count in the cloud cheaply or even for free in a matter of seconds, that may not seem enough. For the past couple of years, a lot of thought and effort has been spent on devising mechanisms to tackle this problem, and evolve the ecosystem in a gradual way towards a state where PyData libraries can run on a GPU, as well as in distributed mode across multiple GPUs.

We feel like a shared vision has emerged, in bits and pieces. In this post, we aim to articulate that vision and suggest a path to making it concrete, focusing on three libraries at the core of the PyData ecosystem: SciPy, scikit-learn and scikit-image. We are also happy to share that AMD has recognized the value of this vision, and is partnering with Quansight Labs to help make it a reality.

Read more…

NumPy Benchmarking

In this blog post, I'll be talking about my journey in Quansight. I want to share all things I was involved in and accomplished. What issues I faced, and most importantly, what were awesome life hacks I learned during this period.

First of all, I'd like to express my gratitude to the whole team for allowing me to be a part of such a great team. My work was majorly focused on providing performance benchmarks to NumPy in realistic situations. The target was to show the world that NumPy is efficient in handling quasi real-life situations too.

The primary technical outcome of my work is available in the numpy documentation.

A word cloud with themes, open-source projects and people mentioned throughout the blog post. Each is stylized using a different font, most of them calligraphical.

Read more…

An efficient method of calling C++ functions from numba using clang++/ctypes/rbc

The aim of this post is to explore a method of calling C++ library functions from Numba compiled functions --- Python functions that are decorated with numba.jit(nopython=True).

While there exist ways to wrap C++ codes to Python (see Appendix below), calling these wrappers from Numba compiled functions is often not as straightforward and efficient as one would hope.

Read more…

Array Libraries Interoperability

In this blog post I talk about the work that I was able to accomplish during my internship at Quansight Labs and the efforts being made towards making array libraries more interoperable.

Going ahead, I'll assume basic understanding of array and tensor libraries with their usage in the Python Scientific and Data Science software stack.

Meme of Master Splinter leading the baby turtles from TMNT. Splinter
     represents NumPy, and the turtles represent TensorFlow, CuPy, PyTorch and JAX.Master NumPy leading the young Tensor Turtles

Read more…

Re-Engineering CI/CD pipelines for SciPy

In this blog post I talk about the projects and my work during my internship at Quansight Labs. My efforts were geared towards re-engineering CI/CD pipelines for SciPy to make them more efficient to use with GitHub Actions. I also talk about the milestones that I achieved, along with the associated learnings and improvements that I made.

This blog post would assume a basic understanding of CI/CD and GitHub Actions. I will also assume a basic understanding of Python and the SciPy ecosystem.

The picture displays a logo of Quansight Labs on the left and a logo of SciPy on the right. It signifies the primary purpose of the project at Quansight Labs to re-engineer the GitHub Actions CI for the SciPy and lay down the further scope for developing an entire CI matrix for build, test and release.Re-Engineering CI/CD pipelines for SciPy

Read more…

Using Hypothesis to test array-consuming libraries

Hypothesis logo accompanied by the text "Property-based testing for the Array API"

Over the summer, I've been interning at Quansight Labs to develop testing tools for the developers and users of the upcoming Array API standard. Specifically, I contributed "strategies" to the testing library Hypothesis, which I'm excited to announce are now available in hypothesis.extra.array_api. Check out the primary pull request I made for more background.

This blog post is for anyone developing array-consuming methods (think SciPy and scikit-learn) and is new to property-based testing. I demonstrate a typical workflow of testing with Hypothesis whilst writing an array-consuming function that works for all libraries adopting the Array API, catching bugs before your users do.

Read more…

Dataframe interchange protocol and Vaex

The work I briefly describe in this blog post is the implementation of the dataframe interchange protocol into Vaex which I was working on through the three month period as a Quansight Labs Intern.

Dataframe protocol will enable data interchange between different dataframe libraries for example cuDF, Vaex, Koalas, Pandas, etc. From all of these Vaex is the library for which the implementation of the dataframe protocol was attempted. Vaex is a high performance Python library for lazy Out-of-Core DataFrames.Connection between dataframe libraries with dataframe protocol

About | What is all that?

Today there are quite a number of different dataframe libraries available in Python. Also, there are quite a number of, for example, plotting libraries. In most cases they accept only the general Pandas dataframe and so the user is quite often made to convert between dataframes in order to be able to use the functionalities of a specific plotting library. It would be extremely cool to be able to use plotting libraries on any kind of dataframe, would it not?

Read more…

Low-code contributions through GitHub

Healthy, inclusive communities are critical to impactful open source projects. A challenge for established projects is that the history and implicit technical debt increase the barrier to contribute to significant portions of code base. The literacy of large code bases happens over time through incremental contributions, and we'll discuss a format that can help people begin this journey.

At Quansight Labs, we are motivated to provide opportunities for new contributors to experience open source community work regardless of their software literacy. Community workshops are a common format for onboarding, but sometimes the outcome can be less than satisfactory for participants and organizers. In these workshops, there are implicit challenges that need to be overcome to contribute to projects' revision history like Git or setting up development environments.

Our goal with the following low-code workshop is to offer a way for folks to join a project's contributors list without the technical overhead. To achieve this we'll discuss a format that relies solely on the GitHub web interface.

Read more…

Not a checklist: different accessibility needs in JupyterLab

JupyterLab Accessibility Journey Part 3

In a pandemic, the template joke-starter “x and y walk into a bar” seems like a stretch from my reality. So let’s try this remote version:

Two community members with accessibility knowledge enter a virtual meeting room to talk about JupyterLab. They’ve both updated themselves on GitHub issues ahead of time. They’ve both identified major problems with the interface. They both get ready to express to the rest of the community what is indisputably, one hundred percent for-sure the biggest accessibility blocker in JupyterLab for users. Here it is, the moment of truth!

And they each say totally different things.

Read more…