Reputation: 399
I'm trying to build a public facing API that collects data through scraping HTML (the content of the page is what is important, not the pages themselves). I've elected to use Django-Rest-Framework as my backend. My question is: How exactly would I organize the structure of this project so that the Django ORM stores the scraped content and then it can be accessed using Django-Rest-Framework's API?
I've looked into Scrapy, but that seems less focused on content scraping and more focused on webcrawling. Additionally, it deploys in its own project, which makes conflicts with Django's bootstrapping.
Is my best bet just running cronjobs? That seems inelegant.
Upvotes: 0
Views: 1183
Reputation: 31555
Use Celery to create asynchronous and periodic tasks.
If you need something lightweight for scraping, you can use BeautifulSoup. Here is a tutorial.
Overall, this is what you need to do:
Upvotes: 1