Developers of open source software often have a difficult time understanding how others utilize their libraries. Having better data of when and how functions are being used has many benefits. Some of these are:

better API design
determining whether or not a feature can be deprecated or removed.
more instructive tutorials
understanding the adoption of new features

Python Namespace Inspection

We wrote a general tool python-api-inspect to analyze any function/attribute call within a given set of namespaces in a repository. This work was heavily inspired by a blog post on inspecting method usage with Google BigQuery for pandas, NumPy, and SciPy. The previously mentioned work used regular expressions to search for method usage. The primary issue with this approach is that it cannot handle import numpy.random as rand; rand.random(...) unless additional regular expressions are constructed for each case and will result in false positives. Additionally, BigQuery is not a free resource. Thus, this approach is not general enough and does not scale well with the number of libraries that we would like to inspect function and attribute usage.

A more robust approach is to inspect the Python abstract syntax tree (AST). Python comes with a performant method from the ast module ast.parse(...) for constructing a Python AST from source code. A node visitor is used to traverse the AST and record import statements, and function/attribute calls. This allows us to catch any absolute namespace reference. The following are cases that python-api-inspect catches:

import numpy
import numpy as np
import numpy.random as rnd
from numpy import random as rand

numpy.array([1, 2, 3])
numpy.random.random((2, 3))
np.array([1, 2, 3])
rnd.random((2, 3))
rand.random((2, 3))

There are limitations to this approach since Python is a heavily duck-typed language. To understand this see the following two examples.

def foobar(array):
    return array.transpose()

a = numpy.array(...)

a.transpose()
foobar(a)

How is one supposed to infer that a.transpose() is a numpy numpy.ndarray method or foobar is a function that takes a numpy.ndarray as input? These are open questions that would allow for further inspection of how libraries use given functions and attributes. It should be noted that dynamically typed languages in general have this problem. Now that the internals of the tool have been discussed, the usage is quite simple. The repository Quansight-Labs/python-api-inspect comes with two command line tools (Python scripts). The important tool inspect_api.py has heavy caching of downloaded repositories and source files that have been analyzed. Inspecting a file the second time is a sqlite3 lookup. Currently, this repository inspects 17 libraries/namespaces and around 10,000 repositories (100 GB compressed). It has been designed to have no other dependencies than the Python stdlib and easily run from the command line. Below is the command that is run when inspecting all the libraries that depend on numpy.

python inspect_api.py data/numpy-whitelist.ini \
  --exclude-dirs test,tests,site-packages \
  --extensions ipynb,py \
  --output data/inspect.sqlite

The command comes with several options that can be useful for filtering the results. --exclude-dirs is used to exclude directories from counts (e.g. tests directory or site-packages directory) within a repository. This option reveals the use of a given namespace in tests as opposed to within the library. --extensions is by default all Python files *.py but can also include Jupyter notebooks *.ipynb showing us how users use a namespace in an interactive context. Unsurprisingly this work found that many Jupyter notebooks in repositories have syntax errors.

While not the focus of this post, an additional script is provided in the repository dependant-packages.py. This script is used to populate the data/numpy-whitelist.ini file with repositories that depend on numpy. This would not be possible without the libraries.io API. It is a remarkable project which deserves more attention.

Results

The table below summarizes the findings of namespace usage within all *.py files, all *.py in only test directories, all *.py files excluding ones within test directories (tests, test), and only Jupyter notebook *.ipynb files. All of the results are provided as csv files. It is important to note that the inspect_api.py script gets much more detail than is included in the csv files and there is plenty of additional work that could be done with this tool for general Python ast analysis.

Library	Whitelist	Summary only `.py`	Summary only `.py` tests	Summary only `.py` without tests	Summary only `.ipynb`
astropy	ini	csv	csv	csv	csv
dask	ini	csv	csv	csv	csv
ipython	ini	csv	csv	csv	csv
ipywidgets	ini	csv	csv	csv	csv
matplotlib	ini	csv	csv	csv	csv
numpy	ini	csv	csv	csv	csv
pandas	ini	csv	csv	csv	csv
pyarrow	ini	csv	csv	csv	csv
pymapd	ini	csv	csv	csv	csv
pymc3	ini	csv	csv	csv	csv
pytorch	ini	csv	csv	csv	csv
requests	ini	csv	csv	csv	csv
scikit-image	ini	csv	csv	csv	csv
scikit-learn	ini	csv	csv	csv	csv
scipy	ini	csv	csv	csv	csv
statsmodels	ini	csv	csv	csv	csv
sympy	ini	csv	csv	csv	csv
tensorflow	ini	csv	csv	csv	csv

Since many namespaces were checked we will highlight only some of the results. First for NumPy the unsurprising function calls: numpy.array, numpy.zeros, numpy.asarray, numpy.arange, numpy.sqrt, numpy.sum, and numpy.dot. There are plans to deprecate numpy.matrix and this seem possible since it numpy.matrix is not in the top 150 functions calls. Numpy testing functions were the expected testing.assert_allclose, testing.assert_almost_equal, and testing.assert_equal.

SciPy acts as a glue for many algorithms needed for scientific and numerical work. The usage of scipy is surprising and also possibly the most accurate results of the following analysis. This is due to the fact that scipy tends to be function wrappers over lower level routines and less class instance methods which are harder to detect as discussed above. The sparse methods are heavily used along with several high level wrappers for scipy.interpolate.interp1d and scipy.optimize.minimize. I was surprised to find out one of my favorite SciPy methods, scipy.signal.find_peaks, is rarely used! Only a small fraction of the scipy.signal functions are used and these include: scipy.signal.lfilter, scipy.signal.fftconvolve, scipy.signal.convolve2d, scipy.signal.lti, and scipy.signal.savgol_filter.

scikit-learn is a popular library for data analysis and offers some of the traditional machine learning algorithms. Interestingly here we order the most used models.

sklearn.linear_model.LogisticRegression
sklearn.decomposition.PCA
sklearn.ensemble.RandomForestClassifier
sklearn.cluster.KMeans
sklearn.svm.SVC

pandas is another popular data analysis library for tabular data that helped drive the popularity of Python. One of the huge benefits of pandas is that it allows reading many file formats to a single in memory pandas.DataFrame object. Unsurprisingly the most popular pandas functions are pandas.DataFrame and pandas.Series. Here we rank the most popular pandas.read_* functions.

pandas.read_csv
pandas.read_table
pandas.read_sql_query
pandas.read_json
pandas.read_pickle

requests makes working with HTTP requests easier to work with than the stdlib urllib.request and is one of the most downloaded packages. Looking at the data for usage of requests, three functions are primarily used (everything else is used 3-5x less): requests.get, requests.post, and requests.Session with headers being the most common argument.

Overall it is clear that libraries are being used differently within either a package, tests, or notebooks. Notebooks tend to prefer high level routines such as scipy.optimize.minimize, numpy.linspace, matplotlib.pyplot.plot which can be used for demos. Additionally notebook function usage would be a good metric for material that is worthwhile to include in introduction and quick-start documentation. The same goes for testing and development documentation that is equally informed as to what functions are used in tests and in packages. Further work is necessary to generalize this tool as it could be useful for the Python ecosystem to better understand through analytics how the language is being used.

Python Namespace Inspection

Results

Comments