Python数据分析框架Ibis

ibis详细介绍

官网：https://docs.ibis-project.org/index.html

ibis是一个新的 Python 数据分析框架，它用来桥接本地Python环境（如：pandas、scikit-learn）与远程大数据环境（如：hdfs、hive、impala、spark等）。ibis目标是让数据科学家和数据工程师们处理大型数据时，能够像处理小中型数据一样的高效，充分利用单机资源。

环境

1	pip install ibis-framework

案例

实现以下SQL：

1 2	// ibis 默认数据量10000 SELECT * FROM d.t LIMIT 10000;

ibis的实现：

import ibis
# 客户端连接
client = ibis.impala.connect(host='0.0.0.0', port=20050, auth_mechanism="GSSAPI", kerberos_service_name='impala')
# 访问表
table = client.table('t', database='d')
# 查询
df = table.execute()
# 返回结果就是：pandas.core.frame.DataFrame类型
df.describe

稍微复杂一点的案例

实现以下SQL：

SELECT count(distinct(id)), pt_dt
FROM d.t
WHERE (pt_dt >= "2018-08-01"
       AND pt_dt <= "2018-08-28")
GROUP BY pt_dt;

ibis的实现：

import ibis
client = ibis.impala.connect(host='0.0.0.0', port=20050, auth_mechanism="GSSAPI", kerberos_service_name='impala')

# SELECT
table = client.table('t', database='d')
t = table['id', 'pt_dt']
# WHERE
filtered = t.filter([t.pt_dt >= "2018-08-20", t.pt_dt <= "2018-08-29"])
# DISTINCT
metric = t.uid.nunique()
# GROUP BY
expr = (filtered.group_by('pt_dt').aggregate(unique_uid=metric))
# RUN
df = expr.execute()
df.describe