uarray: Attempting to move the ecosystem forward

There comes a time in every project where most technological hurdles have been surpassed, and its adoption is a social problem. I believe uarray and unumpy had reached such a state, a month ago.

I then proceeded, along with Ralf Gommers and Peter Bell to write NumPy Enhancement Proposal 31 or NEP-31. This generated a lot of excellent feedback on the structure and the nuances of the proposal, which you can read both on the pull request and on the mailing list discussion, which led to a lot of restructuring in the contents and the structure of the NEP, but very little in the actual proposal. I take full responsibility for this: I have a bad tendency to assume everyone knows what I'm thinking. Thankfully, I'm not alone in this: It's a known psychological phenomenon.

Read more…

Quansight Labs Work Update for September, 2019

As of November, 2018, I have been working at Quansight. Quansight is a new startup founded by the same people who started Anaconda, which aims to connect companies and open source communities, and offers consulting, training, support and mentoring services. I work under the heading of Quansight Labs. Quansight Labs is a public-benefit division of Quansight. It provides a home for a "PyData Core Team" which consists of developers, community managers, designers, and documentation writers who build open-source technology and grow open-source communities around all aspects of the AI and Data Science workflow.

My work at Quansight is split between doing open source consulting for various companies, and working on SymPy. SymPy, for those who do not know, is a symbolic mathematics library written in pure Python. I am the lead maintainer of SymPy.

In this post, I will detail some of the open source work that I have done recently, both as part of my open source consulting, and as part of my work on SymPy for Quansight Labs.

Bounds Checking in Numba

As part of work on a client project, I have been working on contributing code to the numba project. Numba is a just-in-time compiler for Python. It lets you write native Python code and with the use of a simple @jit decorator, the code will be automatically sped up using LLVM. This can result in code that is up to 1000x faster in some cases:

Read more…

Ruby wrappers for the XND project

Table of Contents

Introduction

Lack of stable and reliable scientific computing software has been a persistent problem for the Ruby community, making it hard for enthusiastic Ruby developers to use Ruby in everything from their web applications to their data analysis projects. One of the most important components of any successful scientific software stack is a well maintained and flexible array computation library that can act as a fast and simple way of storing in-memory data and interfacing it with various fast and battle-tested libraries like LAPACK and BLAS.

Various projects have attempted to make such libraries in the past (and some are still thriving and maintained). Some of the notable ones are numo, nmatrix, and more recently, numruby. These projects attempt to provide a simple Ruby-like API for creating and manipulating arrays of various types. All of them are able to easily interface with libraries like ATLAS, FFTW and LAPACK.

However, all of the above projects fall short in two major aspects:

  • Lack of extensibility to adapt to modern use cases (read Machine Learning).
  • Lack of a critical mass of developers to maintain a robust and fast array library.

The first problem is mainly due to the fact that they do not support very robust type systems. The available data types are limited and are hard to extend to more complex uses. Modern use cases like Machine Learning require a more robust type system (i.e. defining array shapes of arbitrary dimension on multiple devices), as has been demonstrated by the tensor implementations of various frameworks like Tensorflow and PyTorch.

The second problem is due to the fact that all of the aforementioned projects are community efforts that are maintained part-time by developers simply out of a sense of purpose and passion. Sustaining such complex projects for extended periods of time without expectation of any support is simply unfeasible even for the most driven engineers.

This is where the XND project comes in. The XND project is a project for building a common library that is able to meet the needs of the various data analysis and machine learning frameworks that have had to build their own array objects and programming languages. It is built with the premise of extending arrays with new types and various device types (CPUs, GPUs etc.) without loss of performance and ease of use.

Read more…

Quansight Labs Dask Update

This post provides an update on some recent Dask-related activities the Quansight Labs team has been working on.

Dask community work order

Through a community work order (CWO) with the D. E. Shaw group, the Quansight Labs team has been able to dedicate developer time towards bug fixes and feature requests for Dask. This work has touched on several portions of the Dask codebase, but generally have centered around using Dask Arrays with the distributed scheduler.

Read more…

Spyder 4.0 beta4: Kite integration is here

Kite is sponsoring the work discussed in this blog post, and in addition supports Spyder 4.0 development through a Quansight Labs Community Work Order.

As part of our next release, we are proud to announce an additional completion client for Spyder, Kite. Kite is a novel completion client that uses Machine Learning techniques to find and predict the best autocompletion for a given text. Additionally, it collects improved documentation for compiled packages, i.e., Matplotlib, NumPy, SciPy that cannot be obtained easily by using traditional code analysis packages such as Jedi.

alt_text

Read more…

Quansight presence at SciPy'19

Yesterday the SciPy'19 conference ended. It was a lot of fun, and very productive. You can really feel that there's a lot of energy in the community, and that it's growing and maturing. This post is just a quick update to summarize Quansight's presence and contributions, as well as some of the more interesting things I noticed.

A few highlights

The "Open Source Communities" track, which had a strong emphasis on topics like burnout, diversity and sustainability, as well as the keynotes by Stuart Geiger ("The Invisible Work of Maintaining and Sustaining Open-Source Software") and Carol Willing ("Jupyter: Always Open for Learning and Discovery") showed that many more people and projects are paying more attention to and evolving their thinking on the human and organizational aspects of open source.

I did not go to many technical talks, but did make sure to catch Matt Rocklin's talk "Refactoring the SciPy Ecosystem for Heterogeneous Computing". Matt clearly explained some key issues and opportunities around the state of array computing libraries in Python - I highly recommend watching this talk.

Abigail Cabunoc Mayes' talk "Work Open, Lead Open (#WOLO) for Sustainability" was fascinating - it made me rethink the governance models and roles we use for our projects, and I worked on some of her concrete suggestions during the sprints.

Read more…

Ibis: Python data analysis productivity framework

Ibis is a library pretty useful on data analysis tasks that provides a pandas-like API that allows operations like create filter, add columns, apply math operations etc in a lazy mode so all the operations are just registered in memory but not executed and when you want to get the result of the expression you created, Ibis compiles that and makes a request to the remote server (remote storage and execution systems like Hadoop components or SQL databases). Its goal is to simplify analytical workflows and make you more productive.

Ibis was created by Wes McKinney and is mainly maintained by Phillip Cloud and Krisztián Szűcs. Also, recently, I was invited to become a maintainer of the Ibis repository!

Maybe you are thinking: "why should I use Ibis?". Well, if you have any of the following issues, probably you should consider using Ibis in your analytical workflow!

  • if you need to get data from a SQL database but you don't know much about SQL ...
  • if you create SQL statements manually using string and have a lot of IF's in your code that compose specific parts of your SQL code (it could be pretty hard to maintain and it will makes your code pretty ugly) ...
  • if you need to handle data with a big volume ...

uarray update: API changes, overhead and comparison to __array_function__

uarray is a generic override framework for objects and methods in Python. Since my last uarray blogpost, there have been plenty of developments, changes to the API and improvements to the overhead of the protocol. Let’s begin with a walk-through of the current feature set and API, and then move on to current developments and how it compares to __array_function__. For further details on the API and latest developments, please see the API page for uarray. The examples there are doctested, so they will always be current.

Motivation

Other array objects

NumPy is a simple, rectangular, dense, and in-memory data store. This is great for some applications but isn't complete on its own. It doesn't encompass every single use-case. The following are examples of array objects available today that have different features and cater to a different kind of audience.

  • Dask is one of the most popular ones. It allows distributed and chunked computation.
  • CuPy is another popular one, and allows GPU computation.
  • PyData/Sparse is slowly gaining popularity, and is a sparse, in-memory data store.
  • XArray includes named dimensions.
  • Xnd is another effort to re-write and modernise the NumPy API, and includes support for GPU arrays and ragged arrays.
  • Another effort (although with no Python wrapper, only data marshalling) is xtensor.

Some of these objects can be composed. Namely, Dask both expects and exports the NumPy API, whereas XArray expects the NumPy API. This makes interesting combinations possible, such as distributed sparse or GPU arrays, or even labelled distributed sparse or CPU/GPU arrays.

Also, there are many other libraries (a popular one being scikit-learn) that need a back-end mechanism in order to be able to support different kinds of array objects. Finally, there is a desire to see SciPy itself gain support for other array objects.

__array_function__ and its limitations

One of my motivations for working on uarray were the limitations of the __array_function__ protocol, defined in this proposal. The limitations are threefold:

  • It can only dispatch on array objects.
  • Consequently, it can only dispatch on functions that accept array objects.
  • It has no mechanism for conversion and coercion.
  • Since it conflates arrays and backends, only a single backend type per array object is possible.

These limitations have been partially discussed before.

uarray — The solution?

With that out of the way, let's explore uarray, a library that hopes to resolve these issues, and even though the original motivation was NumPy and array computing, the library itself is meant to be a generic multiple-dispatch mechanism.

In [1]:
# Enable __array_function__ for NumPy < 1.17.0
!export NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1
In [2]:
import uarray as ua
import numpy as np

In uarray, the fundamental building block is a multimethod. Multimethods have a number of nice properties, such as automatic dispatch based on backends. It is important to note here that multimethods will be written by API authors, rather than implementors. Here's how we define a multimethod in uarray:

Read more…

Labs update and May highlights

Time flies when you're having fun. Here is an update of some of the highlights of my second month at Quansight Labs.

The making of a black hole image & GitHub Sponsors

Both Travis and myself were invited by GitHub to attend GitHub Satellite in Berlin. The main reason was that Nat Friedman (GitHub CEO) decided to spend the first 20 minutes of his keynote to highlight the Event Horizon Telescope's black hole image and the open source software that made that imaging possible. This included the scientific Python very prominently - NumPy, Matplotlib, Python, Cython, SciPy, AstroPy and other projects were highlighted. At the same time, Nat introduced new GitHub features like "used by", a triaging role and new dependency graph features and illustrated how those worked for NumPy. These features will be very welcome news to maintainers of almost any project.

GitHub Satellite'19 keynote, showcasing NumPy and Matplotlib

The single most visible feature introduced was GitHub Sponsors:

GitHub Sponsors enabled on the NumPy repo

I really enjoyed meeting Devon Zuegel, Product Manager of the Open Source Economy Team at GitHub, in person after previously having had the chance to exchange ideas with her about the funding related needs of scientific Python projects and their core teams. I'm confident that GitHub Sponsors will evolve in a direction that's beneficial to community-driven open source projects.

Read more…

TDK-Micronas partners with Quansight to sponsor Spyder

TDK-Micronas is sponsoring Spyder development efforts through Quansight Labs. This will enable the development of some features that have been requested by our users, as well as new features that will help TDK develop custom Spyder plugins in order to complement their Automatic Test Equipment (ATE’s) in the development of their Application Specific Integrated Circuits (ASIC’s).

At this point it may be useful to clarify the relationship the role of Quansight Labs in Spyder's development and the relationship with TDK. To quote Ralf Gommers (director of Quansight Labs):

"We're an R&D lab for open source development of core technologies around data science and scientific computing in Python. And focused on growing communities around those technologies. That's how I see it for Spyder as well: Quansight Labs enables developers to be employed to work on Spyder, and helps with connecting them to developers of other projects in similar situations. Labs should be an enabler to let the Spyder project, its community and individual developers grow. And Labs provides mechanisms to attract and coordinate funding. Of course the project is still independent. If there are other funding sources, e.g. donations from individuals to Spyder via OpenCollective, all the better."

Multiple Projects aka Workspaces

In its current state Spyder can only handle one active project at a time. Although in the past we had basic support for workspaces, it was never a fully functional feature, so to ease development and simplify the user experience, we decided to remove it in the 3.x series.

Read more…