基于 Golang 解析 SQL 字段级血缘

之前做 SQL 血缘解析时,也调研了当下开源的方案,解析 SQL 语句的倒是一大把,解析表级血缘的也有,但是到了字段级,我只看到了 Python 实现的 https://github.com/reata/sqllineage 这个项目。

于是我自己去研究了下这个库,发现有些问题,不能满足我的需要:

  • 它底层用到的是 https://github.com/andialbrecht/sqlparse 这个库来解析 SQL AST 语法树
  • 而 sqlparse 这个解析器,它是泛解析,并不严格校验 SQL 语法的正确性
  • 如果不严格区分语法,那么面对不同的 SQL 方言时,可能会出差错

在企业级系统中,没有解析到血缘还好,一旦解析出来了,就务必正确,不能搞个错误的结果出来,这是底线。

所以我的研发路线与开源项目不同,我要先严格校验 SQL 语法,并且解析出 AST 语法树之后,再来提取字段级血缘。当然表级血缘自然也有了。

在线 Demo

基于我的研发成果,做了个在线体验,地址是 https://kyle.ai/sql ,这个地址只支持解析 Hive SQL 类型,实际上我实现了好几种 SQL 类型的字段级血缘:

  • Flink SQL
  • Hive SQL
  • Spark SQL
  • MySQL
  • ClickHouse SQL

示例

以这个测试 SQL 为例

WITH
    reg AS (
        SELECT
            reg_channel,
            reg_subchannel,
            reg_date,
            COUNT(DISTINCT uid) AS reg_num
        FROM
            info_detail
        WHERE
            reg_date >= '2023-01-01'
        GROUP BY
            reg_channel,
            reg_subchannel,
            reg_date
    ),
    setup AS (
        SELECT
            reg_channel,
            reg_subchannel,
            kaihu_date,
            COUNT(DISTINCT uid) AS setup_num
        FROM
            info_detail
        WHERE
            kaihu_date >= '2023-01-01'
        GROUP BY
            reg_channel,
            reg_subchannel            
    ),
    pc AS (
        SELECT
            reg_channel,
            reg_subchannel,
            first_money_in_date AS first_pc_date,
            COUNT(DISTINCT uid) AS pc_num
        FROM
            info_detail
        WHERE
            first_money_in_date >= '2023-01-01'
        GROUP BY
            reg_channel,
            reg_subchannel,
            first_money_in_date
    ),
    front AS (
        SELECT
            ad_date,
            reg_channel,
            sub_channel,
            cost,
            impression,
            click,
            install_usr
        FROM
            ad_full_line
        WHERE
            ad_date >= '2023-01-01'
            AND (
                (cost IS NOT NULL)
                OR (impression IS NOT NULL)
                OR (click IS NOT NULL)
                OR (install_usr IS NOT NULL)
            )
    ),
    join_all AS (
        SELECT
            COALESCE(
                stat_date,
                reg_date,
                kaihu_date,
                first_pc_date,
                ad_date
            ) AS stat_date,
            COALESCE(
                a.reg_channel,
                b.reg_channel,
                c.reg_channel,
                d.reg_channel,
                e.reg_channel
            ) AS reg_channel,
            COALESCE(
                a.reg_subchannel,
                b.reg_subchannel,
                c.reg_subchannel,
                d.reg_subchannel,
                e.sub_channel
            ) AS reg_subchannel,
            nvl(cost, 0) AS cost,
            nvl(impression, 0) AS impression,
            nvl(click, 0) AS click,
            nvl(reg_num, 0) AS reg_num,
            nvl(setup_num, 0) AS setup_num,
            nvl(pc_num, 0) AS pc_num,
            nvl(aum, 0) AS aum
        FROM
            day_stat a
            FULL OUTER JOIN reg b ON stat_date = reg_date
            AND a.reg_channel = b.reg_channel
            AND a.reg_subchannel = b.reg_subchannel
            FULL OUTER JOIN setup c ON stat_date = kaihu_date
            AND a.reg_channel = c.reg_channel
            AND a.reg_subchannel = c.reg_subchannel
            FULL OUTER JOIN pc d ON stat_date = first_pc_date
            AND a.reg_channel = d.reg_channel
            AND a.reg_subchannel = d.reg_subchannel
            FULL OUTER JOIN front e ON stat_date = ad_date
            AND a.reg_channel = e.reg_channel
            AND a.reg_subchannel = e.sub_channel
    ),
    join_all2 AS (
        SELECT
            stat_date,
            reg_channel,
            reg_subchannel,
            MAX(cost) AS cost,
            MAX(impression) AS impression,
            MAX(click) AS click,
            MAX(reg_num) AS reg_num,
            MAX(setup_num) AS setup_num,
            MAX(pc_num) AS pc_num,
            MAX(aum) AS aum
        FROM
            join_all
        GROUP BY
            stat_date,
            reg_channel,
            reg_subchannel
    ) INSERT overwrite TABLE ad_conver_day
SELECT
    'hk' AS area,
    stat_date,
    ad_platform,
    account_id,
    campaign_id,
    campaign_name,
    group_id,
    group_name,
    a.reg_channel,
    a.reg_subchannel,
    cost,
    impression,
    click,
    reg_num,
    setup_num,
    pc_num,
    aum
FROM
    join_all2 a
    JOIN dimension_table b ON a.reg_channel = b.reg_channel
    AND a.reg_subchannel = b.sub_channel

解析的字段级血缘结果,用图像展示如下

其中虚线的是中间过程表。

因为我的实现需要先严格解析语法树,所以当 SQL 不合法时,程序会直接报错:

Parse sql syntax error, line: 5, message: no viable alternative at input 'with\r\n reg as (\r\n select\r\n reg_channel,\r\n reg_subchannel,,'

这点开源实现会友好一些,他们不会报错,跳过错误继续解析。

这里演示另一个案例,表别名冲突的情况

INSERT INTO
    us_option_qty_call_max_0
SELECT
    a.stock_code AS stock_code,
    a.short_qty AS short_qty
FROM
    (
        SELECT
            stock_code,
            short_qty
        FROM
            (
                SELECT
                    stock_code,
                    short_qty
                FROM
                    (
                        SELECT
                            stock_code,
                            short_qty,
                            short_qty + long_qty AS total_qty
                        FROM
                            futu_data_inter.us_option_qty_call_daily
                    ) a
            ) a
    ) a

上面的 SQL 中,查询别名都叫 a ,命名冲突,这时我的程序会自动给它重命名,解析如下

还有单从 SQL 语句中,无法推断字段从哪个表来的,这时候输出的血缘表名,会是多个表的组合,例如

INSERT INTO
    query06
SELECT
    a.ca_state state,
    cnt
FROM
    customer_address a,
    customer c,
    date_dim d

解析结果为

你也可以在请求接口中,带上你的数据库表结构定义,这时候我的解析程序就能正确推断出 cnt 字段是从哪个表来的了。

OK,更多的细节就不啰嗦了,感兴趣可以前往上面的 Demo 地址体验。欢迎留言交流。

另外,源代码暂时无法开放出来。

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注