metadsl PyData talk
PyData NYC just ended and I thought it would be good to collect my thoughts on
metadsl based on the many conversations I had there surrounding it. This is a rather long post, so if you are just looking for some code here is a Binder link for my talk. Also, here is the talk I gave a month or so later on the same topic in Austin:
class Number(metadsl.Expression): @metadsl.expression def __add__(self, other: Number) -> Number: ... @metadsl.expression @classmethod def from_int(cls, i: int) -> Number: ... @metadsl.rule def add_zero(y: Number): yield Number.from_int(0) + y, y yield y + Number.from_int(0), y
Spyder 4 will be released very soon with lots of interesting new features that you'll want to check out, reflecting years of effort by the team to improve the user experience. In this post, we will be talking about the improvements made to the Variable Explorer.
These include the brand new Object Explorer for inspecting arbitrary Python variables, full support for MultiIndex dataframes with multiple dimensions, and the ability to filter and search for variables by name and type, and much more.
It is important to mention that several of the above improvements were made possible through integrating the work of two other projects. Code from gtabview was used to implement the multi-dimensional Pandas indexes, while objbrowser was the foundation of the new Object Explorer.
I'm very pleased to announce that NumPy and OpenBLAS just received a $195,000 grant from the Chan Zuckerberg Initiative, through its Essential Open Source Software for Science (EOSS) program! This is good news for both projects, and I'm particularly excited about the types of activities we'll be undertaking, what this will mean in terms of growing the community, and to be part of the first round of funded projects of this visionary program.
The press release gives a high level overview of the program, and the grantee website lists the 32 successful applications. Other projects that got funded include SciPy and Matplotlib (it's the very first significant funding for both projects!), Pandas, Zarr, scikit-image, JupyterHub, and Bioconda - we're in good company!
Nicholas Sofroniew and Dario Taborelli, two of the people driving the EOSS program, wrote a blog post that's well worth reading about the motivations for starting this program and the 42 projects that applied and got funded: The Invisible Foundations of Biomedicine.
Version 4.0 of Spyder—a powerful Python IDE designed for scientists, engineers and data analysts—is almost ready! It has been in the making for well over two years, and it contains lots of interesting new features. We will focus on the Files pane in this post, where we've made several improvements to the interface and file management tools.
In order to simplify the Files pane's interface, the columns corresponding to size and kind are hidden by default. To change which columns are shown, use the top-right pane menu or right-click the header directly.
There comes a time in every project where most technological hurdles have been surpassed, and its adoption is a social problem. I believe
unumpy had reached such a state, a month ago.
I then proceeded, along with Ralf Gommers and Peter Bell to write NumPy Enhancement Proposal 31 or NEP-31. This generated a lot of excellent feedback on the structure and the nuances of the proposal, which you can read both on the pull request and on the mailing list discussion, which led to a lot of restructuring in the contents and the structure of the NEP, but very little in the actual proposal. I take full responsibility for this: I have a bad tendency to assume everyone knows what I'm thinking. Thankfully, I'm not alone in this: It's a known psychological phenomenon.
As of November, 2018, I have been working at Quansight. Quansight is a new startup founded by the same people who started Anaconda, which aims to connect companies and open source communities, and offers consulting, training, support and mentoring services. I work under the heading of Quansight Labs. Quansight Labs is a public-benefit division of Quansight. It provides a home for a "PyData Core Team" which consists of developers, community managers, designers, and documentation writers who build open-source technology and grow open-source communities around all aspects of the AI and Data Science workflow.
My work at Quansight is split between doing open source consulting for various companies, and working on SymPy. SymPy, for those who do not know, is a symbolic mathematics library written in pure Python. I am the lead maintainer of SymPy.
In this post, I will detail some of the open source work that I have done recently, both as part of my open source consulting, and as part of my work on SymPy for Quansight Labs.
Bounds Checking in Numba
As part of work on a client project, I have been working on contributing code
to the numba project. Numba is a just-in-time
compiler for Python. It lets you write native Python code and with the use of
@jit decorator, the code will be automatically sped up using LLVM.
This can result in code that is up to 1000x faster in some cases:
Table of Contents
- Conclusion and Future Work
Lack of stable and reliable scientific computing software has been a persistent problem for the Ruby community, making it hard for enthusiastic Ruby developers to use Ruby in everything from their web applications to their data analysis projects. One of the most important components of any successful scientific software stack is a well maintained and flexible array computation library that can act as a fast and simple way of storing in-memory data and interfacing it with various fast and battle-tested libraries like LAPACK and BLAS.
Various projects have attempted to make such libraries in the past (and some are still thriving and maintained). Some of the notable ones are numo, nmatrix, and more recently, numruby. These projects attempt to provide a simple Ruby-like API for creating and manipulating arrays of various types. All of them are able to easily interface with libraries like ATLAS, FFTW and LAPACK.
However, all of the above projects fall short in two major aspects:
- Lack of extensibility to adapt to modern use cases (read Machine Learning).
- Lack of a critical mass of developers to maintain a robust and fast array library.
The first problem is mainly due to the fact that they do not support very robust type systems. The available data types are limited and are hard to extend to more complex uses. Modern use cases like Machine Learning require a more robust type system (i.e. defining array shapes of arbitrary dimension on multiple devices), as has been demonstrated by the tensor implementations of various frameworks like Tensorflow and PyTorch.
The second problem is due to the fact that all of the aforementioned projects are community efforts that are maintained part-time by developers simply out of a sense of purpose and passion. Sustaining such complex projects for extended periods of time without expectation of any support is simply unfeasible even for the most driven engineers.
This is where the XND project comes in. The XND project is a project for building a common library that is able to meet the needs of the various data analysis and machine learning frameworks that have had to build their own array objects and programming languages. It is built with the premise of extending arrays with new types and various device types (CPUs, GPUs etc.) without loss of performance and ease of use.
This post provides an update on some recent Dask-related activities the Quansight Labs team has been working on.
Dask community work order
Through a community work order (CWO) with the D. E. Shaw group, the Quansight Labs team has been able to dedicate developer time towards bug fixes and feature requests for Dask. This work has touched on several portions of the Dask codebase, but generally have centered around using Dask Arrays with the distributed scheduler.