Automated data scraping using CRON

Question

I am currently working on an application that harvests information about greyhounds from a racing website and performs a number of calculations on the given data. The current application manages to display the data correctly and perform the correct calculations by performing individual YQL requests on the racing website based on the users input.

However, I've found that due to the large amount of HTTP calls and the lack of data caching, the application tends to be a bit slow. To speed it up and open up the ability to further analyse the data, I would like to build some sort of system that will scrape and store all the data relevant to a day on the night before via a cron tab. However, I'm unsure as to how to go about it.

At the moment, the application goes through the following rough process:

Allow user to select a date
Perform YQL query and iterate through result to get all the races on that date
Allow user to select race from the above list
Perform YQL query and iterate through result to get all the dogs in the race
Perform YQL query and iterate through result to get all the races performed by each dog
Calculate statistics based on the races performed by each dog
Output everything

As you can see, there are quite a few seprate HTTP requests. This is unavoidable as each dataset exists on a different page on the racing website. For that reason, I would much rather get the bulk of the processing out of the way through a separate system and have the data stored in a database as opposed to being harvested and processed when the user requests.

I could easily extract the extraction and calculation processing from the current system and just have them run from a cron tab but they would all be running from a single PHP request. That means that the server would have to iterate over literally thousands of pieces of data, storing each set in a database, all within one PHP request. Having not tried it out, I would assume that the request would timeout?

So to sum up, here are my questions:

If I placed the processing into a single PHP file and ran it from cron, would it timeout before finishing the job or would it just continue to plough through?
Is there any pre-existing library that deals with such a task?
Any thoughts on alternative ways to accomplish this?

Many thanks,

Dan

Byron Whitlock · Accepted Answer

Intstead of mass crawling the site, how about on demand caching?

This is probably easier to implement, and doesn't make the race site suspicious if their TOS doesn't allow crawling (it probably doesn't).

You just need a local sql table that is keyed by date, and has columns for the statistics you are already outputting.

Your flow would go something like

Allow user to select a date
Do sql query to find precomputed data for that date. If data doesn't exist goto 3, otherwise goto 9.
Perform YQL query and iterate through result to get all the races on that date
Allow user to select race from the above list
Perform YQL query and iterate through result to get all the dogs in the race
Perform YQL query and iterate through result to get all the races performed by each dog.
Calculate statistics based on the races performed by each dog
Store statistics by user date entered into a sql table.
Output everything

Automated data scraping using CRON

Answers (1)

Related Questions