Quansight Labs Dask Update

This post provides an update on some recent Dask-related activities the Quansight Labs team has been working on.

Dask community work order

Through a community work order (CWO) with the D. E. Shaw group, the Quansight Labs team has been able to dedicate developer time towards bug fixes and feature requests for Dask. This work has touched on several portions of the Dask codebase, but generally have centered around using Dask Arrays with the distributed scheduler.

Read more…

Spyder 4.0 beta4: Kite integration is here

Kite is sponsoring the work discussed in this blog post, and in addition supports Spyder 4.0 development through a Quansight Labs Community Work Order.

As part of our next release, we are proud to announce an additional completion client for Spyder, Kite. Kite is a novel completion client that uses Machine Learning techniques to find and predict the best autocompletion for a given text. Additionally, it collects improved documentation for compiled packages, i.e., Matplotlib, NumPy, SciPy that cannot be obtained easily by using traditional code analysis packages such as Jedi.

alt_text

Read more…

Quansight presence at SciPy'19

Yesterday the SciPy'19 conference ended. It was a lot of fun, and very productive. You can really feel that there's a lot of energy in the community, and that it's growing and maturing. This post is just a quick update to summarize Quansight's presence and contributions, as well as some of the more interesting things I noticed.

A few highlights

The "Open Source Communities" track, which had a strong emphasis on topics like burnout, diversity and sustainability, as well as the keynotes by Stuart Geiger ("The Invisible Work of Maintaining and Sustaining Open-Source Software") and Carol Willing ("Jupyter: Always Open for Learning and Discovery") showed that many more people and projects are paying more attention to and evolving their thinking on the human and organizational aspects of open source.

I did not go to many technical talks, but did make sure to catch Matt Rocklin's talk "Refactoring the SciPy Ecosystem for Heterogeneous Computing". Matt clearly explained some key issues and opportunities around the state of array computing libraries in Python - I highly recommend watching this talk.

Abigail Cabunoc Mayes' talk "Work Open, Lead Open (#WOLO) for Sustainability" was fascinating - it made me rethink the governance models and roles we use for our projects, and I worked on some of her concrete suggestions during the sprints.

Read more…

Ibis: Python data analysis productivity framework

Ibis is a library pretty useful on data analysis tasks that provides a pandas-like API that allows operations like create filter, add columns, apply math operations etc in a lazy mode so all the operations are just registered in memory but not executed and when you want to get the result of the expression you created, Ibis compiles that and makes a request to the remote server (remote storage and execution systems like Hadoop components or SQL databases). Its goal is to simplify analytical workflows and make you more productive.

Ibis was created by Wes McKinney and is mainly maintained by Phillip Cloud and Krisztián Szűcs. Also, recently, I was invited to become a maintainer of the Ibis repository!

Maybe you are thinking: "why should I use Ibis?". Well, if you have any of the following issues, probably you should consider using Ibis in your analytical workflow!

  • if you need to get data from a SQL database but you don't know much about SQL ...
  • if you create SQL statements manually using string and have a lot of IF's in your code that compose specific parts of your SQL code (it could be pretty hard to maintain and it will makes your code pretty ugly) ...
  • if you need to handle data with a big volume ...

uarray update: API changes, overhead and comparison to __array_function__

uarray is a generic override framework for objects and methods in Python. Since my last uarray blogpost, there have been plenty of developments, changes to the API and improvements to the overhead of the protocol. Let’s begin with a walk-through of the current feature set and API, and then move on to current developments and how it compares to __array_function__. For further details on the API and latest developments, please see the API page for uarray. The examples there are doctested, so they will always be current.

Motivation

Other array objects

NumPy is a simple, rectangular, dense, and in-memory data store. This is great for some applications but isn't complete on its own. It doesn't encompass every single use-case. The following are examples of array objects available today that have different features and cater to a different kind of audience.

  • Dask is one of the most popular ones. It allows distributed and chunked computation.
  • CuPy is another popular one, and allows GPU computation.
  • PyData/Sparse is slowly gaining popularity, and is a sparse, in-memory data store.
  • XArray includes named dimensions.
  • Xnd is another effort to re-write and modernise the NumPy API, and includes support for GPU arrays and ragged arrays.
  • Another effort (although with no Python wrapper, only data marshalling) is xtensor.

Some of these objects can be composed. Namely, Dask both expects and exports the NumPy API, whereas XArray expects the NumPy API. This makes interesting combinations possible, such as distributed sparse or GPU arrays, or even labelled distributed sparse or CPU/GPU arrays.

Also, there are many other libraries (a popular one being scikit-learn) that need a back-end mechanism in order to be able to support different kinds of array objects. Finally, there is a desire to see SciPy itself gain support for other array objects.

__array_function__ and its limitations

One of my motivations for working on uarray were the limitations of the __array_function__ protocol, defined in this proposal. The limitations are threefold:

  • It can only dispatch on array objects.
  • Consequently, it can only dispatch on functions that accept array objects.
  • It has no mechanism for conversion and coercion.
  • Since it conflates arrays and backends, only a single backend type per array object is possible.

These limitations have been partially discussed before.

uarray — The solution?

With that out of the way, let's explore uarray, a library that hopes to resolve these issues, and even though the original motivation was NumPy and array computing, the library itself is meant to be a generic multiple-dispatch mechanism.

In [1]:
# Enable __array_function__ for NumPy < 1.17.0
!export NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1
In [2]:
import uarray as ua
import numpy as np

In uarray, the fundamental building block is a multimethod. Multimethods have a number of nice properties, such as automatic dispatch based on backends. It is important to note here that multimethods will be written by API authors, rather than implementors. Here's how we define a multimethod in uarray:

Read more…

Labs update and May highlights

Time flies when you're having fun. Here is an update of some of the highlights of my second month at Quansight Labs.

The making of a black hole image & GitHub Sponsors

Both Travis and myself were invited by GitHub to attend GitHub Satellite in Berlin. The main reason was that Nat Friedman (GitHub CEO) decided to spend the first 20 minutes of his keynote to highlight the Event Horizon Telescope's black hole image and the open source software that made that imaging possible. This included the scientific Python very prominently - NumPy, Matplotlib, Python, Cython, SciPy, AstroPy and other projects were highlighted. At the same time, Nat introduced new GitHub features like "used by", a triaging role and new dependency graph features and illustrated how those worked for NumPy. These features will be very welcome news to maintainers of almost any project.

GitHub Satellite'19 keynote, showcasing NumPy and Matplotlib

The single most visible feature introduced was GitHub Sponsors:

GitHub Sponsors enabled on the NumPy repo

I really enjoyed meeting Devon Zuegel, Product Manager of the Open Source Economy Team at GitHub, in person after previously having had the chance to exchange ideas with her about the funding related needs of scientific Python projects and their core teams. I'm confident that GitHub Sponsors will evolve in a direction that's beneficial to community-driven open source projects.

Read more…

TDK-Micronas partners with Quansight to sponsor Spyder

TDK-Micronas is sponsoring Spyder development efforts through Quansight Labs. This will enable the development of some features that have been requested by our users, as well as new features that will help TDK develop custom Spyder plugins in order to complement their Automatic Test Equipment (ATE’s) in the development of their Application Specific Integrated Circuits (ASIC’s).

At this point it may be useful to clarify the relationship the role of Quansight Labs in Spyder's development and the relationship with TDK. To quote Ralf Gommers (director of Quansight Labs):

"We're an R&D lab for open source development of core technologies around data science and scientific computing in Python. And focused on growing communities around those technologies. That's how I see it for Spyder as well: Quansight Labs enables developers to be employed to work on Spyder, and helps with connecting them to developers of other projects in similar situations. Labs should be an enabler to let the Spyder project, its community and individual developers grow. And Labs provides mechanisms to attract and coordinate funding. Of course the project is still independent. If there are other funding sources, e.g. donations from individuals to Spyder via OpenCollective, all the better."

Multiple Projects aka Workspaces

In its current state Spyder can only handle one active project at a time. Although in the past we had basic support for workspaces, it was never a fully functional feature, so to ease development and simplify the user experience, we decided to remove it in the 3.x series.

Read more…

metadsl: A Framework for Domain Specific Languages in Python

metadsl: A Framework for Domain Specific Languages in Python

Hello, my name is Saul Shanabrook and for the past year or so I have been at Quansight exploring the array computing ecosystem. This started with working on the xnd project, a set of low level primitives to help build cross platform NumPy-like APIs, and then started exploring Lenore Mullin's work on a mathematics of arrays. After spending quite a bit of time working on an integrated solution built on these concepts, I decided to step back to try to generalize and simplify the core concepts. The trickiest part was not actually compiling mathematical descriptions of array operations in Python, but figuring out how to make it useful to existing users. To do this, we need to meet users where they are at, which is with the APIs they are already familiar with, like numpy. The goal of metadsl is to make it easier to tackle parts of this problem seperately so that we can collaborate on tackling it together.

Libraries for Scientific Computing

Much of the recent rise of Python's popularity is due to its usage for scientific computing and machine learning. This work is built on different frameworks, like Pandas, NumPy, Tensorflow, and scikit-learn. Each of these are meant to be used from Python, but have their own concepts and abstractions to learn on top of the core language, so we can look at them as Domain Specific Languages (DSLs). As the ecosystem has matured, we are now demanding more flexibility for how these languages are executed. Dask gives us a way to write Pandas or NumPy and execute it across many cores or computers, Ibis allows us to write Pandas but on a SQL database, with CuPy we can execute NumPy on our GPU, and with Numba we can optimize our NumPy expession on a CPU or GPU. These projects prove that it is possible to write optimizing compilers that target varying hardware paradigms for existing Python numeric APIs. However, this isn't straightforward and these projects success is a testament to the perserverence and ingenuity of the authors. We need to make it easy to add reusable optimizations to libraries like these, so that we can support the latest hardware and compiler optimizations from Python. metadsl is meant to be a place to come together to build a framework for DSLs in Python. It provides a way to seperate the user experience from the the specific of execution, to enable consistency and flexibility for users. In this post, I will go through an example of creating a very basic DSL. It will not use the metadsl library, but will created in the same style as metadsl to illustrate its basic principles.

Community-driven open source and funded development

Quansight Labs is an experiment for us in a way. One of our main aims is to channel more resources into community-driven PyData projects, to keep them healthy and accelerate their development. And do so in a way that projects themselves stay in charge.

This post explains one method we're starting to use for this. I'm writing it to be transparent with projects, the wider community and potential funders about what we're starting to do. As well as to explicitly solicit feedback on this method.

Community work orders

If you talk to someone about supporting an open source project, in particular a well-known one that they rely on (e.g. NumPy, Jupyter, Pandas), they're often willing to listen and help. What you quickly learn though is that they want to know in some detail what will be done with the funds provided. This is true not only for companies, but also for individuals. In addition, companies will likely want a written agreement and some form of reporting about the progress of the work. To meet this need we came up with community work orders (CWOs) - agreements that outline what work will be done on a project (implementing new features, release management, improving documentation, etc.) and outlining a reporting mechanism. What makes a CWO different from a consulting contract? Key differences are:

  1. It must be work that is done on the open source project itself (and not e.g. on a plugin for it, or a customization for the client).
  2. The developers must have a reasonable amount of freedom to decide what to work on and what the technical approach will be, within the broad scope of the agreement.
  3. Deliverables cannot be guaranteed to end up in a project; instead the funder gets the promise of a best effort of implementation and working with the community.

Respecting community processes

Point 3 above is particularly important: we must respect how open source projects make decisions. If the project maintainers decide that they don't want to include a particular change or new feature, that's their decision to make. Any code change proposed as part of work on a CWO has to go through the same review process as any other change, and be accepted on its merits. The argument "but someone paid for this" isn't particularly strong, nor is one that reviewers should have to care about. Now of course we don't expect it to be common for work to be rejected. An important part of the Quansight value proposition is that because we understand how open source works and many of our developers are maintainers and contributors of the open source projects already, we propose work that the community already has interest in and we open the discussion about any major code change early to avoid issues.

Read more…

Measuring API usage for popular numerical and scientific libraries

Developers of open source software often have a difficult time understanding how others utilize their libraries. Having better data of when and how functions are being used has many benefits. Some of these are:

  • better API design
  • determining whether or not a feature can be deprecated or removed.
  • more instructive tutorials
  • understanding the adoption of new features

Python Namespace Inspection

We wrote a general tool python-api-inspect to analyze any function/attribute call within a given set of namespaces in a repository. This work was heavily inspired by a blog post on inspecting method usage with Google BigQuery for pandas, NumPy, and SciPy. The previously mentioned work used regular expressions to search for method usage. The primary issue with this approach is that it cannot handle import numpy.random as rand; rand.random(...) unless additional regular expressions are constructed for each case and will result in false positives. Additionally, BigQuery is not a free resource. Thus, this approach is not general enough and does not scale well with the number of libraries that we would like to inspect function and attribute usage.

A more robust approach is to inspect the Python abstract syntax tree (AST). Python comes with a performant method from the ast module ast.parse(...) for constructing a Python AST from source code. A node visitor is used to traverse the AST and record import statements, and function/attribute calls. This allows us to catch any absolute namespace reference. The following are cases that python-api-inspect catches:

Read more…