Measuring API usage for popular numerical and scientific libraries

Developers of open source software often have a difficult time understanding how others utilize their libraries. Having better data of when and how functions are being used has many benefits. Some of these are:

  • better API design
  • determining whether or not a feature can be deprecated or removed.
  • more instructive tutorials
  • understanding the adoption of new features

Python Namespace Inspection

We wrote a general tool python-api-inspect to analyze any function/attribute call within a given set of namespaces in a repository. This work was heavily inspired by a blog post on inspecting method usage with Google BigQuery for pandas, NumPy, and SciPy. The previously mentioned work used regular expressions to search for method usage. The primary issue with this approach is that it cannot handle import numpy.random as rand; rand.random(...) unless additional regular expressions are constructed for each case and will result in false positives. Additionally, BigQuery is not a free resource. Thus, this approach is not general enough and does not scale well with the number of libraries that we would like to inspect function and attribute usage.

A more robust approach is to inspect the Python abstract syntax tree (AST). Python comes with a performant method from the ast module ast.parse(...) for constructing a Python AST from source code. A node visitor is used to traverse the AST and record import statements, and function/attribute calls. This allows us to catch any absolute namespace reference. The following are cases that python-api-inspect catches:

import numpy
import numpy as np
import numpy.random as rnd
from numpy import random as rand

numpy.array([1, 2, 3])
numpy.random.random((2, 3))
np.array([1, 2, 3])
rnd.random((2, 3))
rand.random((2, 3))

There are limitations to this approach since Python is a heavily duck-typed language. To understand this see the following two examples.

def foobar(array):
    return array.transpose()

a = numpy.array(...)

a.transpose()
foobar(a)

How is one supposed to infer that a.transpose() is a numpy numpy.ndarray method or foobar is a function that takes a numpy.ndarray as input? These are open questions that would allow for further inspection of how libraries use given functions and attributes. It should be noted that dynamically typed languages in general have this problem. Now that the internals of the tool have been discussed, the usage is quite simple. The repository Quansight-Labs/python-api-inspect comes with two command line tools (Python scripts). The important tool inspect_api.py has heavy caching of downloaded repositories and source files that have been analyzed. Inspecting a file the second time is a sqlite3 lookup. Currently, this repository inspects 17 libraries/namespaces and around 10,000 repositories (100 GB compressed). It has been designed to have no other dependencies than the Python stdlib and easily run from the command line. Below is the command that is run when inspecting all the libraries that depend on numpy.

python inspect_api.py data/numpy-whitelist.ini \
  --exclude-dirs test,tests,site-packages \
  --extensions ipynb,py \
  --output data/inspect.sqlite

The command comes with several options that can be useful for filtering the results. --exclude-dirs is used to exclude directories from counts (e.g. tests directory or site-packages directory) within a repository. This option reveals the use of a given namespace in tests as opposed to within the library. --extensions is by default all Python files *.py but can also include Jupyter notebooks *.ipynb showing us how users use a namespace in an interactive context. Unsurprisingly this work found that many Jupyter notebooks in repositories have syntax errors.

While not the focus of this post, an additional script is provided in the repository dependant-packages.py. This script is used to populate the data/numpy-whitelist.ini file with repositories that depend on numpy. This would not be possible without the libraries.io API. It is a remarkable project which deserves more attention.

Results

The table below summarizes the findings of namespace usage within all *.py files, all *.py in only test directories, all *.py files excluding ones within test directories (tests, test), and only Jupyter notebook *.ipynb files. All of the results are provided as csv files. It is important to note that the inspect_api.py script gets much more detail than is included in the csv files and there is plenty of additional work that could be done with this tool for general Python ast analysis.

Library Whitelist Summary only `.py` Summary only `.py` tests Summary only `.py` without tests Summary only `.ipynb`
astropy ini csv csv csv csv
dask ini csv csv csv csv
ipython ini csv csv csv csv
ipywidgets ini csv csv csv csv
matplotlib ini csv csv csv csv
numpy ini csv csv csv csv
pandas ini csv csv csv csv
pyarrow ini csv csv csv csv
pymapd ini csv csv csv csv
pymc3 ini csv csv csv csv
pytorch ini csv csv csv csv
requests ini csv csv csv csv
scikit-image ini csv csv csv csv
scikit-learn ini csv csv csv csv
scipy ini csv csv csv csv
statsmodels ini csv csv csv csv
sympy ini csv csv csv csv
tensorflow ini csv csv csv csv

Since many namespaces were checked we will highlight only some of the results. First for NumPy the unsurprising function calls: numpy.array, numpy.zeros, numpy.asarray, numpy.arange, numpy.sqrt, numpy.sum, and numpy.dot. There are plans to deprecate numpy.matrix and this seem possible since it numpy.matrix is not in the top 150 functions calls. Numpy testing functions were the expected testing.assert_allclose, testing.assert_almost_equal, and testing.assert_equal.

SciPy acts as a glue for many algorithms needed for scientific and numerical work. The usage of scipy is surprising and also possibly the most accurate results of the following analysis. This is due to the fact that scipy tends to be function wrappers over lower level routines and less class instance methods which are harder to detect as discussed above. The sparse methods are heavily used along with several high level wrappers for scipy.interpolate.interp1d and scipy.optimize.minimize. I was surprised to find out one of my favorite SciPy methods, scipy.signal.find_peaks, is rarely used! Only a small fraction of the scipy.signal functions are used and these include: scipy.signal.lfilter, scipy.signal.fftconvolve, scipy.signal.convolve2d, scipy.signal.lti, and scipy.signal.savgol_filter.

scikit-learn is a popular library for data analysis and offers some of the traditional machine learning algorithms. Interestingly here we order the most used models.

  1. sklearn.linear_model.LogisticRegression
  2. sklearn.decomposition.PCA
  3. sklearn.ensemble.RandomForestClassifier
  4. sklearn.cluster.KMeans
  5. sklearn.svm.SVC

pandas is another popular data analysis library for tabular data that helped drive the popularity of Python. One of the huge benefits of pandas is that it allows reading many file formats to a single in memory pandas.DataFrame object. Unsurprisingly the most popular pandas functions are pandas.DataFrame and pandas.Series. Here we rank the most popular pandas.read_* functions.

  1. pandas.read_csv
  2. pandas.read_table
  3. pandas.read_sql_query
  4. pandas.read_json
  5. pandas.read_pickle

requests makes working with HTTP requests easier to work with than the stdlib urllib.request and is one of the most downloaded packages. Looking at the data for usage of requests, three functions are primarily used (everything else is used 3-5x less): requests.get, requests.post, and requests.Session with headers being the most common argument.

Overall it is clear that libraries are being used differently within either a package, tests, or notebooks. Notebooks tend to prefer high level routines such as scipy.optimize.minimize, numpy.linspace, matplotlib.pyplot.plot which can be used for demos. Additionally notebook function usage would be a good metric for material that is worthwhile to include in introduction and quick-start documentation. The same goes for testing and development documentation that is equally informed as to what functions are used in tests and in packages. Further work is necessary to generalize this tool as it could be useful for the Python ecosystem to better understand through analytics how the language is being used.

Comments