爬虫如何绕过 Cloudflare 的 DDos 保护验证

被 Cloudflare 保护的站点，在初次访问时，会等待 5 秒钟的验证，检测你是不是通过浏览器正常访问的，如下图：

本文主要说明如果通过技术手段绕过这个验证，我试了两种办法，都管用。

1、使用 python 第三方库，如 https://github.com/VeNoMouS/cloudscraper

使用起来也非常简单，看官方使用文档就好了，示例：

import cloudscraper
scraper = cloudscraper.create_scraper()
res = scraper.get("http://xxx")
print(res.content)

这个库它是用原生的 python 代码来解析和计算 cloudflare 的验证逻辑的，也可以设置采用 nodejs 等外部库来计算验证，具体可看官方文档。

不过这个库有个缺陷就是，如果 Cloudflare 变更了算法，哪怕只改动了一点，这个库就会失效，只能等作者更新代码来支持，比较被动。

2、使用 Splash 来抓取页面

Splash 是一个命令行浏览器，https://splash.readthedocs.io/ ，比起上面我们通过程序来计算，还不如直接让一个真实的浏览器来访问受到保护的网页。

Cloudflare 验证通过后，会生成两个 cookie 值，后面的请求只要一直带上这些 cookie，就不用再次验证。所以我的办法是如果需要验证，就用 splash 访问，访问完后，保存返回的 cookie 与 header 等必要信息，下次带上直接正常访问就行了。

示例代码如下：


requests_timeout = 15


def log(msg):
    print(f"[{datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {msg}", flush=True)


class Sraper:
    splash_lua_script = '''
        treat = require("treat")
        base64 = require("base64")
        local res = {}
        splash.response_body_enabled = true
        splash.request_body_enabled = true
        splash:on_response(function ( response )
            res['url'] = treat.as_string(response.url)
            res['cookies'] = response.request.info['cookies']
            res['set-cookie'] = response.headers["set-cookie"]
            res['method'] = response.request.method
            res['info'] = response.request.info
            response.abort()
        end)
        splash:go(splash.args.url)
        splash:wait(5.5)
        return res
    '''

    def __init__(self):
        self.session = requests.session()
        self.headers = {}

    def splash_request(self, url):
        params = {
            "url": url,
            "lua_source": self.splash_lua_script,
        }
        headers = {
            "Content-Type": "application/json"
        }
        res = self.session.post(urllib.parse.urljoin(ConfigProxy.splash_url, "/run"), headers=headers,
                                data=json.dumps(params), timeout=requests_timeout)
        rdata = res.json()
        cf_headers = {}
        for header in rdata['info']['headers']:
            cf_headers[header['name']] = header['value']
        if 'postData' not in rdata['info']:
            log("Warning: postData not in info dict")
            return None
        postdata = rdata['info']['postData']['text']
        url = rdata['info']['url']
        res = self.session.post(url, headers=cf_headers, data=postdata, timeout=requests_timeout, allow_redirects=False)
        cookie = SimpleCookie()
        cookie.load(res.headers['set-cookie'])
        cookie_str = ""
        for k, v in cookie.items():
            cookie_str += f"{k}={v.value}; "
        self.headers = {
            "Referer": "https://xxx.com",
            "User-Agent": cf_headers['User-Agent'],
            "Cookie": cookie_str,
        }
        return res

    def request(self, url):
        if not self.headers:
            return self.splash_request(url)
        res = self.session.get(url, headers=self.headers, timeout=requests_timeout, allow_redirects=False)
        if res.status_code == 503:
            log("Get 503 response, back to splash_request...")
            return self.splash_request(url)
        else:
            return res


if __name__ == '__main__':
    scraper = Sraper()
    url = 'xxx'
    res = scraper.request(url)
    if res is None:
        log("Get res is None")
        return False
    if res.status_code == 200:
        log('success')
    else:
        log(f"Get {url} , status={res.status_code}")

这里我用到了 Splash 的 lua 脚本，因为 Splash 不能渲染出 pdf 等二进制页面，只能返回 html 正常页面，所以不能使用 splash:html() ，也不能在 splash:on_response 回调中，通过 responde.body 变量拿返回的二进制数据，splash 渲染页面异常，就直接不会给 responde.body 赋值了，就算你设置了 splash.response_body_enabled 或 request:enable_response_body 一样不行，拿不到 response.body 变量。

这时候我让 splash 拿到请求返回的头部后，就直接放弃读取 body，所以才有上面的 lua 脚本这一段：

splash:on_response(function ( response )
            res['url'] = treat.as_string(response.url)
            res['cookies'] = response.request.info['cookies']
            res['set-cookie'] = response.headers["set-cookie"]
            res['method'] = response.request.method
            res['info'] = response.request.info
            response.abort()
        end)

然后我再拿返回的 cookies 以及其他头部信息，自己通过 requests 去访问下载 body 内容。

Kyle's Blog

听而不闻，视而不见，大智若愚，韬光养晦

爬虫如何绕过 Cloudflare 的 DDos 保护验证