Ibis: Python data analysis productivity framework

Ibis is a library pretty useful on data analysis tasks that provides a pandas-like API that allows operations like create filter, add columns, apply math operations etc in a lazy mode so all the operations are just registered in memory but not executed and when you want to get the result of the expression you created, Ibis compiles that and makes a request to the remote server (remote storage and execution systems like Hadoop components or SQL databases). Its goal is to simplify analytical workflows and make you more productive.

Ibis was created by Wes McKinney and is mainly maintained by Phillip Cloud and Krisztián Szűcs. Also, recently, I was invited to become a maintainer of the Ibis repository!

Maybe you are thinking: "why should I use Ibis?". Well, if you have any of the following issues, probably you should consider using Ibis in your analytical workflow!

  • if you need to get data from a SQL database but you don't know much about SQL ...
  • if you create SQL statements manually using string and have a lot of IF's in your code that compose specific parts of your SQL code (it could be pretty hard to maintain and it will makes your code pretty ugly) ...
  • if you need to handle data with a big volume ...

If you want to learn more about ibis consider taking a look at these tutorials:

Do you want to watch some interesting video about Ibis? Check this out:

Now, let's check out some work developed here at Quansight in the last months!

During the last months OmniSci and Quansight were working together to add a backend to Ibis for OmniSciDB (formerly MapD Core)! In a few words, OmniSciDB is an in-memory, column store, SQL relational database designed from the ground up to run on GPUs. If you don't know yet this amazing database, I invite you to check it out.

The implementation of this new backend also resulted in the creation of new expressions/operators on Ibis core, such as:

  • GeoSpatial data types and operations
  • Trigonometric operations
  • Some statistcal operations

First, let's connect to a OmniSciDB and play with this new features!

In [1]:
# install the dependencies if you need!
# !conda install -y ibis-framework=1.1.0 pyarrow pymapd vega geopandas geoalchemy2 shapely matplotlib --force-reinstall
In [2]:
import ibis
from matplotlib import pyplot as plt

print('ibis:', ibis.__version__)
ibis: 1.2.0+7.g3afa8b0d
In [3]:
# metis.mapd.com is used in some OmniSci docs
# but maybe you want to install your own OmniSciDB instance
# you can take a look into installation section at 
# https://www.omnisci.com/docs/latest/
# also you maybe want to check the omniscidb-cpu conda package
# conda install -c conda-forge omniscidb-cpu
# if you need any help, feel free to open an issue at
# https://github.com/conda-forge/omniscidb-cpu-feedstock/
omniscidb_cli = ibis.mapd.connect(
    host='metis.mapd.com', 
    user='mapd', 
    password='HyperInteractive',
    port=443, 
    database='mapd',
    protocol='https'
)

GeoSpatial features

You need to handle geospatial data in a esier way?

Let's take a look inside zipcodes_2017 table!

Well, currently omniscidb backend doesn't support geopandas output, so let's use a workaround for that! It should be implemented into omniscidb backend soon! (see: gist-code)

In [4]:
gist_url = 'https://gist.githubusercontent.com/xmnlab/587dd1bde44850f3117a1087ed3f0f28/raw/0750400db90cf97319a91aa514648c31ad4ace45/omniscidb_geopandas_output.py'
!wget {gist_url} -O omniscidb_geopandas_output.py
--2019-07-05 11:31:57--  https://gist.githubusercontent.com/xmnlab/587dd1bde44850f3117a1087ed3f0f28/raw/0750400db90cf97319a91aa514648c31ad4ace45/omniscidb_geopandas_output.py
Resolviendo gist.githubusercontent.com (gist.githubusercontent.com)... 151.101.48.133
Conectando con gist.githubusercontent.com (gist.githubusercontent.com)[151.101.48.133]:443... conectado.
Petición HTTP enviada, esperando respuesta... 200 OK
Longitud: 1874 (1,8K) [text/plain]
Guardando como: “omniscidb_geopandas_output.py”

omniscidb_geopandas 100%[===================>]   1,83K  --.-KB/s    en 0s      

2019-07-05 11:31:57 (70,1 MB/s) - “omniscidb_geopandas_output.py” guardado [1874/1874]

In [5]:
# workaround to use geopandas output
from omniscidb_geopandas_output import enable_geopandas_output 
enable_geopandas_output(omniscidb_cli)
In [6]:
t = omniscidb_cli.table('zipcodes_2017')
display(t)
In [7]:
print('# rows:', t.count().execute())
# rows: 33144

This table has ~33k rows. For this example, let's use just the first 1k rows.

In [8]:
expr = t[t.omnisci_geo].head(1000)
df = expr.execute()

Instead of getting all rows from the database and get from that the first 1000 rows, Ibis will prepare a SQL statement to get just the first 1000 rows! So it reduces the memory consuming to just the data you need!

This is what Ibis will request to the database:

In [9]:
print(expr.compile())
SELECT "omnisci_geo"
FROM zipcodes_2017
LIMIT 1000

Of course geospatial data reading as text wouldn't be useful, so let's plot the result!

Remember: we are using geopandas here!

In [10]:
# let's add some custom style :)
style_kwds = {
    'linewidth': 2,
    'markersize': 2,
    'facecolor': 'red',
    'edgecolor': 'red'
}

df['omnisci_geo'].iloc[::3].plot(**style_kwds)
plt.show()

Trigonometric operations

Currently the OmniSciDB backend supports the follow trigonometric operations: acos, asin, atan, atan2, cos, cot, sin, tan.

Let's check an example using a sin operation over rowid from zipcodes_2017.

In [11]:
# if you want to use a SQL statement try`sql` method!
t = omniscidb_cli.sql('select rowid from zipcodes_2017')

expr = t[t.rowid, t.rowid.sin().name('rowid_sin')].sort_by('rowid').head(100)
expr.execute().rowid_sin.plot()
plt.show()

Some statistical operations

The OmniSciDB Ibis backend also implements some statistical operations, such as: Correlation (corr), Standard Deviation (stddev), Variance (var) and Covariance (cov).

Let's check a pretty simple example: if there is any correlation in this dataset between per capita income and education.

In [12]:
t = omniscidb_cli.table('demo_vote_clean')
# remove some conflictives fields: 'TYPE', 'NAME', 'COUNTY' field
fields = [name for name in t.schema().names if name not in ('TYPE', 'NAME', 'COUNTY')]
t = t[fields].distinct()
t.PerCapitaIncome.corr(t.Education).execute()
Out[12]:
0.7212061029308654

The result ~0.72 means that Per Capita Income and Education has a positive correlation in this dataset.

Conclusions

Ibis is a cool library that can help you in your data analysis tasks. If you already use pandas, it will be pretty easy to add Ibis in your workflow!

So ...

  • Are you excited to use Ibis? Try it out now!
  • Have you already used Ibis? Reach out to me, ivan.ogasawara@quansight.com, and share your experience!
  • Are you interested in contributing to Ibis? Check the good first issues label on GitHub!
  • Do you want to add new features and want to fund Ibis? Contact us at info@quansight.com
In [ ]:
 

Comments