PHP, MySQL, Cron Job - Efficient method to maintain current/live data in large tables?

Question

This is mostly theory, so I apologize if it gets wordy.

Background

The project I'm working on pulls information from other websites (external, not hosted by us). We would like to have as-close-to-live information as possible, so that our users are presented with immediately pertinent information. This means monitoring and updating the table constantly.

It is difficult to show my previous work on this, but I have searched high and low for the last couple of weeks, for "maintaining live data in databases," and "instantly updating database when external changes made," and similar. But all to no avail. I imagine the problem of maintaining up-to-date records is common, so I am unsure why thorough solutions for it seem to be so uncommon.

To keep with the guidelines for SO, I am not looking for opinions, but rather for current best practices and most commonly used/accepted, efficient methods in the industry.

Currently, with a cron job, the best we can do is run an process every minute.

* * * * * cd /home/.../public_html/.../ && /usr/bin/php .../robot.php >/dev/null 2>&1

The thing is, we are pulling data from multiple thousands of other sites (each row is a site), and sometimes an update can take a couple minutes or more. Calling the function only once a minute is not good enough. Ideally, we want near-instant resolution.

Checking if a row needs to be updated is quick. Essentially just your simple hash comparison:

if(hash(current) != hash(previous)){
    ... update row ...
}

Using processes fired exclusively by the cron job means that if a row ends up getting updated, the process is held-up until it is done, or until the cron job fires a new process a minute later.

No bueno! Pas bien! If, by some horrible twist of fate, every row needed to be updated, then it could potentially take hours (or longer) before all records are current. And in that time, rows that had already been passed over would be out of date.

Note: The DB is set up in such a way that rows currently being updated are inaccessible to new processes. The function essentially crawls down the table, finds the next available row that has not been read/updated, and dives in. Once finished with the update, it continues down to the next available row.

Each process is killed when it reaches the end of the table, or when all the rows in the table are marked as read. At this point, all rows are reset to unread, and the process starts over.

With the amount of data being collected, the only way to improve resolution is to have multiple processes running at once.

But how many is too many?

Possible Solution (method)

The best method I've come up with so far, to get through all rows as quickly as possible, is this:

Cron Job calls first process (P1)
P1 skims the table until it finds a row that is unread and requires updating, and dives in
As soon as P1 enters the row, it calls a second identical process (P2) to continue from that point
P2 skims the table until it finds a row that is unread and requires updating, and dives in
As soon as P2 enters the row, it calls a third identical process (P3) to continue from that point

... and so on.

Essentially, every time a process enters a row to update it, a new process is called to continue on.

BUT... the parent processes are not dead. This means that as soon as they are finished with their updates, they begin to crawl the table again, looking for the next available row.

AND... on top of this all, a new cron job is still fired every minute.

What this means is that potentially thousands of identical processes could be running at the same time. The number of processes cannot exceed the number of records in the table. Worst-case scenario is that every row is being updated simultaneously, and a cron job or two are fired before any updates are finished. The cron jobs will immediately die, since no rows are available to update. As each process finishes with its updates, it would also immediately die for the same reason.

The scenario above is worst-case. It is unlikely that more than 5 or 10 rows will ever need to be updated each pass, but theoretically it is possible to have every row being updated simultaneously.

Possible Improvements (primarily on resources, not speed or resolution)

Monitor and limit the number of live processes allowed, and kill any new ones that are fired. But then this begs questions like "how many is too many?", and "what is the minimum number required to achieve a certain resolution?"
Have each process mark multiple rows at a time (5-10), and not continue until all rows in the set have been dealt with. This would have the effect of decreasing the maximum number of simultaneous processes by a factor of however many rows get marked at a time.

Like I said at the beginning, surely this is a common problem for database architects. Is there a better/faster/more efficient method than what I've laid out, for maintaining current records?

Thanks for keeping with me!

PHP, MySQL, Cron Job - Efficient method to maintain current/live data in large tables?

Answers (1)

Related Questions