之前做 SQL 血缘解析时,也调研了当下开源的方案,解析 SQL 语句的倒是一大把,解析表级血缘的也有,但是到了字段级,我只看到了 Python 实现的 https://github.com/reata/sqllineage 这个项目。
于是我自己去研究了下这个库,发现有些问题,不能满足我的需要:
- 它底层用到的是 https://github.com/andialbrecht/sqlparse 这个库来解析 SQL AST 语法树
- 而 sqlparse 这个解析器,它是泛解析,并不严格校验 SQL 语法的正确性
- 如果不严格区分语法,那么面对不同的 SQL 方言时,可能会出差错
在企业级系统中,没有解析到血缘还好,一旦解析出来了,就务必正确,不能搞个错误的结果出来,这是底线。
所以我的研发路线与开源项目不同,我要先严格校验 SQL 语法,并且解析出 AST 语法树之后,再来提取字段级血缘。当然表级血缘自然也有了。
在线 Demo
基于我的研发成果,做了个在线体验,地址是 https://kyle.ai/sql ,这个地址只支持解析 Hive SQL 类型,实际上我实现了好几种 SQL 类型的字段级血缘:
- Flink SQL
- Hive SQL
- Spark SQL
- MySQL
- ClickHouse SQL
示例
以这个测试 SQL 为例
WITH
reg AS (
SELECT
reg_channel,
reg_subchannel,
reg_date,
COUNT(DISTINCT uid) AS reg_num
FROM
info_detail
WHERE
reg_date >= '2023-01-01'
GROUP BY
reg_channel,
reg_subchannel,
reg_date
),
setup AS (
SELECT
reg_channel,
reg_subchannel,
kaihu_date,
COUNT(DISTINCT uid) AS setup_num
FROM
info_detail
WHERE
kaihu_date >= '2023-01-01'
GROUP BY
reg_channel,
reg_subchannel
),
pc AS (
SELECT
reg_channel,
reg_subchannel,
first_money_in_date AS first_pc_date,
COUNT(DISTINCT uid) AS pc_num
FROM
info_detail
WHERE
first_money_in_date >= '2023-01-01'
GROUP BY
reg_channel,
reg_subchannel,
first_money_in_date
),
front AS (
SELECT
ad_date,
reg_channel,
sub_channel,
cost,
impression,
click,
install_usr
FROM
ad_full_line
WHERE
ad_date >= '2023-01-01'
AND (
(cost IS NOT NULL)
OR (impression IS NOT NULL)
OR (click IS NOT NULL)
OR (install_usr IS NOT NULL)
)
),
join_all AS (
SELECT
COALESCE(
stat_date,
reg_date,
kaihu_date,
first_pc_date,
ad_date
) AS stat_date,
COALESCE(
a.reg_channel,
b.reg_channel,
c.reg_channel,
d.reg_channel,
e.reg_channel
) AS reg_channel,
COALESCE(
a.reg_subchannel,
b.reg_subchannel,
c.reg_subchannel,
d.reg_subchannel,
e.sub_channel
) AS reg_subchannel,
nvl(cost, 0) AS cost,
nvl(impression, 0) AS impression,
nvl(click, 0) AS click,
nvl(reg_num, 0) AS reg_num,
nvl(setup_num, 0) AS setup_num,
nvl(pc_num, 0) AS pc_num,
nvl(aum, 0) AS aum
FROM
day_stat a
FULL OUTER JOIN reg b ON stat_date = reg_date
AND a.reg_channel = b.reg_channel
AND a.reg_subchannel = b.reg_subchannel
FULL OUTER JOIN setup c ON stat_date = kaihu_date
AND a.reg_channel = c.reg_channel
AND a.reg_subchannel = c.reg_subchannel
FULL OUTER JOIN pc d ON stat_date = first_pc_date
AND a.reg_channel = d.reg_channel
AND a.reg_subchannel = d.reg_subchannel
FULL OUTER JOIN front e ON stat_date = ad_date
AND a.reg_channel = e.reg_channel
AND a.reg_subchannel = e.sub_channel
),
join_all2 AS (
SELECT
stat_date,
reg_channel,
reg_subchannel,
MAX(cost) AS cost,
MAX(impression) AS impression,
MAX(click) AS click,
MAX(reg_num) AS reg_num,
MAX(setup_num) AS setup_num,
MAX(pc_num) AS pc_num,
MAX(aum) AS aum
FROM
join_all
GROUP BY
stat_date,
reg_channel,
reg_subchannel
) INSERT overwrite TABLE ad_conver_day
SELECT
'hk' AS area,
stat_date,
ad_platform,
account_id,
campaign_id,
campaign_name,
group_id,
group_name,
a.reg_channel,
a.reg_subchannel,
cost,
impression,
click,
reg_num,
setup_num,
pc_num,
aum
FROM
join_all2 a
JOIN dimension_table b ON a.reg_channel = b.reg_channel
AND a.reg_subchannel = b.sub_channel
解析的字段级血缘结果,用图像展示如下
其中虚线的是中间过程表。
因为我的实现需要先严格解析语法树,所以当 SQL 不合法时,程序会直接报错:
Parse sql syntax error, line: 5, message: no viable alternative at input 'with\r\n reg as (\r\n select\r\n reg_channel,\r\n reg_subchannel,,'
这点开源实现会友好一些,他们不会报错,跳过错误继续解析。
这里演示另一个案例,表别名冲突的情况
INSERT INTO
us_option_qty_call_max_0
SELECT
a.stock_code AS stock_code,
a.short_qty AS short_qty
FROM
(
SELECT
stock_code,
short_qty
FROM
(
SELECT
stock_code,
short_qty
FROM
(
SELECT
stock_code,
short_qty,
short_qty + long_qty AS total_qty
FROM
futu_data_inter.us_option_qty_call_daily
) a
) a
) a
上面的 SQL 中,查询别名都叫 a ,命名冲突,这时我的程序会自动给它重命名,解析如下
还有单从 SQL 语句中,无法推断字段从哪个表来的,这时候输出的血缘表名,会是多个表的组合,例如
INSERT INTO
query06
SELECT
a.ca_state state,
cnt
FROM
customer_address a,
customer c,
date_dim d
解析结果为
你也可以在请求接口中,带上你的数据库表结构定义,这时候我的解析程序就能正确推断出 cnt 字段是从哪个表来的了。
OK,更多的细节就不啰嗦了,感兴趣可以前往上面的 Demo 地址体验。欢迎留言交流。
另外,源代码暂时无法开放出来。