Read Our Blog

Dataframe interchange protocol and Vaex

The work I briefly describe in this blog post is the implementation of the dataframe interchange protocol into Vaex which I was working on through the three month period as a Quansight Labs Intern.

Dataframe protocol will enable data interchange between different dataframe libraries for example cuDF, Vaex, Koalas, Pandas, etc. From all of these Vaex is the library for which the implementation of the dataframe protocol was attempted. Vaex is a high performance Python library for lazy Out-of-Core DataFrames.Connection between dataframe libraries with dataframe protocol

About | What is all that?

Today there are quite a number of different dataframe libraries available in Python. Also, there are quite a number of, for example, plotting libraries. In most cases they accept only the general Pandas dataframe and so the user is quite often made to convert between dataframes in order to be able to use the functionalities of a specific plotting library. It would be extremely cool to be able to use plotting libraries on any kind of dataframe, would it not?

Read more…

Low-code contributions through GitHub

Healthy, inclusive communities are critical to impactful open source projects. A challenge for established projects is that the history and implicit technical debt increase the barrier to contribute to significant portions of code base. The literacy of large code bases happens over time through incremental contributions, and we'll discuss a format that can help people begin this journey.

At Quansight Labs, we are motivated to provide opportunities for new contributors to experience open source community work regardless of their software literacy. Community workshops are a common format for onboarding, but sometimes the outcome can be less than satisfactory for participants and organizers. In these workshops, there are implicit challenges that need to be overcome to contribute to projects' revision history like Git or setting up development environments.

Our goal with the following low-code workshop is to offer a way for folks to join a project's contributors list without the technical overhead. To achieve this we'll discuss a format that relies solely on the GitHub web interface.

Read more…

Not a checklist: different accessibility needs in JupyterLab

JupyterLab Accessibility Journey Part 3

In a pandemic, the template joke-starter “x and y walk into a bar” seems like a stretch from my reality. So let’s try this remote version:

Two community members with accessibility knowledge enter a virtual meeting room to talk about JupyterLab. They’ve both updated themselves on GitHub issues ahead of time. They’ve both identified major problems with the interface. They both get ready to express to the rest of the community what is indisputably, one hundred percent for-sure the biggest accessibility blocker in JupyterLab for users. Here it is, the moment of truth!

And they each say totally different things.

Read more…

CZI EOSS4 Grants at Quansight Labs

Here, at Quansight Labs, our goal is to work on sustaining the future of Open Source. We make sure we can live up to that goal by spending a significant amount of time working on impactful and critical infrastructure and projects within the Scientific Ecosystem.

As such, our goals align with those of the Chan Zuckerberg Initiative and, in particular, the Essential Open Source Software for Science (EOSS) program that supports tools essential to biomedical research via funds for software maintenance, growth, development, and community engagement.

CZI’s Essential Open Source Software for Science program supports software maintenance, growth, development, and community engagement for open source tools critical to science. And the Chan Zuckerberg Initiative was founded in 2015 to help solve some of society’s toughest challenges — from eradicating disease and improving education, to addressing the needs of our local communities. Their mission is to build a more inclusive, just, and healthy future for everyone.

Today, we are thrilled to announce that the team at Quansight Labs has been awarded five EOSS Cycle 4 grants to work on several projects within the PyData ecosystem. This post will introduce the successful grantees and their objectives for these two-year long grants.

Read more…

Is GitHub Actions suitable for running benchmarks?

Reliability of benchmarks in GitHub Actions. This 2D plot shows a 16-day timeseries in the X axis.
  Each data point in the X axis corresponds to a cloud of 75 measurements (one per benchmark test).
  The y-axis spread of each cloud corresponds to the performance ratio. Ideal measurements would have
  a performance ratio of 1.0, since both runs returned the exact same performance. In practice this
  does not happen and we can observe ratios between 0.6 and 1.5. This plot shows that while there
  is an observable y-spread, it is small enough to be considered sensitive to performance
  regressions of more than 50%.

Benchmarking software is a tricky business. For robust results, you need dedicated hardware that only runs the benchmarking suite under controlled conditions. No other processes! No OS updates! Nothing else! Even then, you might find out that CPU throttling, thermal regulation and other issues can introduce noise in your measurements.

So, how are we even trying to do it on a CI provider like GitHub Actions? Every job runs in a separate VM instance with frequent updates and shared resources. It looks like it would just be a very expensive random number generator.

Well, it turns out that there is a sensible way to do it: relative benchmarking. And we know it works because we have been collecting stability data points for several weeks.

Read more…

Moving SciPy to the Meson build system

Let's start with an announcement: SciPy now builds with Meson on Linux, and the full test suite passes!

This is a pretty exciting milestone, and good news for SciPy maintainers and contributors - they can look forward to much faster builds and a more pleasant development experience. So how fast is it? Currently the build takes about 1min 50s (a ~4x improvement) on my 3 year old 12-core Intel CPU (i9-7920X @ 2.90GHz):

Profiling result of a parallel build of SciPy with Meson

Profiling result of a parallel build (12 jobs) of SciPy with Meson. Visualization created with ninjatracing and Perfetto.

As you can see from the tracing results, building a single C++ file (bsr.cxx, which is one of SciPy's sparse matrix formats) takes over 90 seconds. So the 1min 50 sec build time is close to optimal - the only ways to improve it are major surgery on that C++ code, or buying a faster CPU.

Read more…

Pyflyby: Improving Efficiency of Jupyter Interactive Sessions

Few things hinder productivity more than interruption. A notification, random realization, or unrelated error can derail one's train of thought when deep in a complex analysis – a frustrating experience.

In the software development context, forgetting to import a statement in an interactive Jupyter session is such an experience. This can be especially frustrating when using typical abbreviations, like np, pd, plt, where the meaning is obvious to the human reader, but not to the computer. The time-to-first-plot, and ability to quickly cleanup one's notebook afterward are critical to an enjoyable and efficient workflow.

In this blogpost we present pyflyby, a project and an extension to IPython and JupyterLab, that, among many things, automatically inserts imports and tidies Python files and notebooks.

Read more…

Distributed Training Made Easy with PyTorch-Ignite

PyTorch-Ignite logo

Authors: François Cokelaer, Priyansi, Sylvain Desroziers, Victor Fomin

Writing agnostic distributed code that supports different platforms, hardware configurations (GPUs, TPUs) and communication frameworks is tedious. In this blog, we will discuss how PyTorch-Ignite solves this problem with minimal code change.

Read more…