Reputation: 17402
I want to chain celery tasks in a STANDARD way.
I've a json file. Inside that file, there many harcoded urls. I need to scrap those links plus scrap the links which are found while scraping those links.
Currently, I'm doing like this.
for each_news_source, news_categories in rss_obj.iteritems():
for each_category in news_categories:
category = each_category['category']
rss_link = each_category['feed']
json_id = each_category['json']
try:
list_of_links = getrsslinks(rss_link)
for link in list_of_links:
scrape_link.delay(link, json_id, category)
except Exception,e:
print "Invalid url", str(e)
I want something where getrsslinks
is also a celery task and then the scrapping of list of urls which is returned by getrsslinks
should also be another celery task.
It follows this pattern
harcodeJSONURL1--
--`getrsslinks` (celery task)
--scrap link 1 (celery task)
--scrap link 2 (celery task)
--scrap link 3 (celery task)
--scrap link 4 (celery task)
harcodeJSONURL2--
--`getrsslinks` (celery task)
--scrap link 1 (celery task)
--scrap link 2 (celery task)
--scrap link 3 (celery task)
--scrap link 4 (celery task)
and so on..
How can I do this??
Upvotes: 0
Views: 2877
Reputation: 4129
Take a look at the subtask options in Celery. In your case groups should help. You just need to call a scrape_link
group inside getrsslinks
.
from celery import group
@app.task
def getrsslinks(rsslink, json_id, category):
# do processing
# Call scrape links
scrape_jobs = group(scrape_link.s(link, json_id, category) for link in link_list)
scrape_jobs.apply_async()
...
You might want getrsslinks
to return scrape_jobs
to monitor the jobs easier. Then when parsing your json file, you would call getrsslinks
like so.
for each_news_source, news_categories in rss_obj.iteritems():
for each_category in news_categories:
category = each_category['category']
rss_link = each_category['feed']
json_id = each_category['json']
getrsslinks.delay(rss_link, json_id, category)
Finally, to monitor which links were invalid (since we replaced the try/except block) you need to store all the getrsslinks
tasks and watch for success or failure. You could use apply_async
with link_error
for this.
Upvotes: 1