Reputation: 987
I am currently working on an application that harvests information about greyhounds from a racing website and performs a number of calculations on the given data. The current application manages to display the data correctly and perform the correct calculations by performing individual YQL requests on the racing website based on the users input.
However, I've found that due to the large amount of HTTP calls and the lack of data caching, the application tends to be a bit slow. To speed it up and open up the ability to further analyse the data, I would like to build some sort of system that will scrape and store all the data relevant to a day on the night before via a cron tab. However, I'm unsure as to how to go about it.
At the moment, the application goes through the following rough process:
As you can see, there are quite a few seprate HTTP requests. This is unavoidable as each dataset exists on a different page on the racing website. For that reason, I would much rather get the bulk of the processing out of the way through a separate system and have the data stored in a database as opposed to being harvested and processed when the user requests.
I could easily extract the extraction and calculation processing from the current system and just have them run from a cron tab but they would all be running from a single PHP request. That means that the server would have to iterate over literally thousands of pieces of data, storing each set in a database, all within one PHP request. Having not tried it out, I would assume that the request would timeout?
So to sum up, here are my questions:
Many thanks,
Dan
Upvotes: 2
Views: 1400
Reputation: 53921
Intstead of mass crawling the site, how about on demand caching?
This is probably easier to implement, and doesn't make the race site suspicious if their TOS doesn't allow crawling (it probably doesn't).
You just need a local sql table that is keyed by date, and has columns for the statistics you are already outputting.
Your flow would go something like
Upvotes: 1