Celery用redis作broker,ETA任务重复执行的bug

我们用celery框架做异步队列执行任务,用redis做celery的broker存储,最近遇到一个问题。

周末我们用celery去执行一个延迟任务,也就是指定执行时间,比如周五设置周六跟周日分别执行推送。

然后到执行时间的时候发现,任务被多次执行了。

从表面上看,好像是celery的每一个worker都分别执行了一次任务,但celery收到确实只有一条任务。

后来,我通过命令观察定时执行的任务:

celery inspect scheduled

发现,新建的定时任务,隔一会后,就自动变成两条了:

kyle@dev-desktop1:~/www/meila_app/deps/meila_queue$ date
2013年 12月 31日 星期二 09:42:44 CST
kyle@dev-desktop1:~/www/meila_app/deps/meila_queue$ env/bin/python run.py inspect scheduled
-> dev-desktop1: OK
    * {u'priority': 6, u'eta': 1388456519.971404, u'request': {u'args': u'[691]', u'time_start': None, u'name': u'meila_queue.tasks.pnmessage.push_message_to_group', u'delivery_info': {u'priority': 0, u'routing_key': u'celery', u'exchange': u'celery'}, u'hostname': u'dev-desktop1', u'acknowledged': False, u'kwargs': u'{}', u'id': u'b754d377-4fe6-4b36-a0ee-f0bffc6f90bc', u'worker_pid': None}}

kyle@dev-desktop1:~/www/meila_app/deps/meila_queue$ date
2013年 12月 31日 星期二 09:54:01 CST
kyle@dev-desktop1:~/www/meila_app/deps/meila_queue$ env/bin/python run.py inspect scheduled
-> dev-desktop1: OK
    * {u'priority': 6, u'eta': 1388456519.971404, u'request': {u'args': u'[691]', u'time_start': None, u'name': u'meila_queue.tasks.pnmessage.push_message_to_group', u'delivery_info': {u'priority': 0, u'routing_key': u'celery', u'exchange': u'celery'}, u'hostname': u'dev-desktop1', u'acknowledged': False, u'kwargs': u'{}', u'id': u'b754d377-4fe6-4b36-a0ee-f0bffc6f90bc', u'worker_pid': None}}
    * {u'priority': 6, u'eta': 1388456519.971404, u'request': {u'args': u'[691]', u'time_start': None, u'name': u'meila_queue.tasks.pnmessage.push_message_to_group', u'delivery_info': {u'priority': 0, u'routing_key': u'celery', u'exchange': u'celery'}, u'hostname': u'dev-desktop1', u'acknowledged': False, u'kwargs': u'{}', u'id': u'b754d377-4fe6-4b36-a0ee-f0bffc6f90bc', u'worker_pid': None}}

一模一样的任务,居然自己克隆出了两条(visibility_timeout设置成10分钟后重现出的结果)。

后面我查celery的redis broker文档 http://docs.celeryproject.org/en/latest/getting-started/brokers/redis.html,发现了这么一段话:


If a task is not acknowledged within the Visibility Timeout the task will be redelivered to another worker and executed.

This causes problems with ETA/countdown/retry tasks where the time to execute exceeds the visibility timeout; in fact if that happens it will be executed again, and again in a loop.

So you have to increase the visibility timeout to match the time of the longest ETA you are planning to use.

Note that Celery will redeliver messages at worker shutdown, so having a long visibility timeout will only delay the redelivery of ‘lost’ tasks in the event of a power failure or forcefully terminated workers.

Periodic tasks will not be affected by the visibility timeout, as this is a concept separate from ETA/countdown.

You can increase this timeout by configuring a transport option with the same name:

BROKER_TRANSPORT_OPTIONS = {‘visibility_timeout’: 43200}

The value must be an int describing the number of seconds.

也就是说,当我们设置一个ETA时间比visibility_timeout长的任务时,每过一次 visibility_timeout 时间,celery就会认为这个任务没被worker执行成功,重新分配给其它worker再执行。celery这方面没有处理好,没考虑到定时任务情况。

解决办法,按文档所述,就是把 visibility_timeout参数调大,比我们ETA的时间差要大。但这样也会有副作用。

celery 项目本身的定位就主要是实时的异步队列,对于这种长时间定时执行,支持不太好。

但是现在我把这个参数改大了,改成了2天,结果一个小时后,任务照样被克隆了。。。

最后没办法,只好把broker从redis换成了 rabbitmq