Birrel
Birrel

Reputation: 4834

PHP, MySQL, Cron Job - Efficient method to maintain current/live data in large tables?

This is mostly theory, so I apologize if it gets wordy.

Background

The project I'm working on pulls information from other websites (external, not hosted by us). We would like to have as-close-to-live information as possible, so that our users are presented with immediately pertinent information. This means monitoring and updating the table constantly.

It is difficult to show my previous work on this, but I have searched high and low for the last couple of weeks, for "maintaining live data in databases," and "instantly updating database when external changes made," and similar. But all to no avail. I imagine the problem of maintaining up-to-date records is common, so I am unsure why thorough solutions for it seem to be so uncommon.

To keep with the guidelines for SO, I am not looking for opinions, but rather for current best practices and most commonly used/accepted, efficient methods in the industry.

Currently, with a cron job, the best we can do is run an process every minute.

* * * * * cd /home/.../public_html/.../ && /usr/bin/php .../robot.php >/dev/null 2>&1

The thing is, we are pulling data from multiple thousands of other sites (each row is a site), and sometimes an update can take a couple minutes or more. Calling the function only once a minute is not good enough. Ideally, we want near-instant resolution.

Checking if a row needs to be updated is quick. Essentially just your simple hash comparison:

if(hash(current) != hash(previous)){
    ... update row ...
}

Using processes fired exclusively by the cron job means that if a row ends up getting updated, the process is held-up until it is done, or until the cron job fires a new process a minute later.

No bueno! Pas bien! If, by some horrible twist of fate, every row needed to be updated, then it could potentially take hours (or longer) before all records are current. And in that time, rows that had already been passed over would be out of date.

Note: The DB is set up in such a way that rows currently being updated are inaccessible to new processes. The function essentially crawls down the table, finds the next available row that has not been read/updated, and dives in. Once finished with the update, it continues down to the next available row.

Each process is killed when it reaches the end of the table, or when all the rows in the table are marked as read. At this point, all rows are reset to unread, and the process starts over.

With the amount of data being collected, the only way to improve resolution is to have multiple processes running at once.

But how many is too many?

Possible Solution (method)

The best method I've come up with so far, to get through all rows as quickly as possible, is this:

  1. Cron Job calls first process (P1)

  2. P1 skims the table until it finds a row that is unread and requires updating, and dives in

  3. As soon as P1 enters the row, it calls a second identical process (P2) to continue from that point

  4. P2 skims the table until it finds a row that is unread and requires updating, and dives in

  5. As soon as P2 enters the row, it calls a third identical process (P3) to continue from that point

... and so on.

Essentially, every time a process enters a row to update it, a new process is called to continue on.

BUT... the parent processes are not dead. This means that as soon as they are finished with their updates, they begin to crawl the table again, looking for the next available row.

AND... on top of this all, a new cron job is still fired every minute.

What this means is that potentially thousands of identical processes could be running at the same time. The number of processes cannot exceed the number of records in the table. Worst-case scenario is that every row is being updated simultaneously, and a cron job or two are fired before any updates are finished. The cron jobs will immediately die, since no rows are available to update. As each process finishes with its updates, it would also immediately die for the same reason.

The scenario above is worst-case. It is unlikely that more than 5 or 10 rows will ever need to be updated each pass, but theoretically it is possible to have every row being updated simultaneously.

Possible Improvements (primarily on resources, not speed or resolution)

  1. Monitor and limit the number of live processes allowed, and kill any new ones that are fired. But then this begs questions like "how many is too many?", and "what is the minimum number required to achieve a certain resolution?"

  2. Have each process mark multiple rows at a time (5-10), and not continue until all rows in the set have been dealt with. This would have the effect of decreasing the maximum number of simultaneous processes by a factor of however many rows get marked at a time.

Like I said at the beginning, surely this is a common problem for database architects. Is there a better/faster/more efficient method than what I've laid out, for maintaining current records?

Thanks for keeping with me!

Upvotes: 2

Views: 1109

Answers (1)

Webeng
Webeng

Reputation: 7113

First of all, I read it all! Just had to pat myself on the back for that :)

What you are probably looking for is a worker queue. A queue is basically a line like the one you would find in a supermarket, and a worker is the woman at the counter receiving the money and doing everything for each customer. When there is no costumer, she doesn't do work, and when there is, she does do work.

When there are a lot of customers in the mall, more of the workers go on the empty counters, and the people buying groceries get distributed amongst all of them.

I have written a lot about queues recently, and the one I most recommend is Beanstalk. It's simple to use, and it uses the Pheanstalk API if you are planning to create queues and workers in php (and from there control what happens in your database in MySQL).

An example of how a queue script and a worker scrip would look is similar to the following (obviously you would add your own code to adapt to your specific needs, and you would generate as many workers as you want. You could even have your workers vary depending on how much demand you have from your queue):

Adding jobs to the queue

<?php
$pheanstalk = new Pheanstalk('127.0.0.1:11300');
$pheanstalk
  ->useTube("my_queue")
  ->put("UPDATE mytable SET price = price + 4 WHERE stock = GOOG");//sql query for instance
?>

From your description, it seems you are setting transactions, which is prohibiting some updates to take place while others are being implemented. This is actually a great reason to use a queue because if a queue job times out, it is sent to the top of the queue (at least in the pheanstalk queue I am describing), which means it won't be lost in the situation of a timeout.

Worker script:

<?php
    $pheanstalk = new Pheanstalk('127.0.0.1:11300');

    if ($job = $pheanstalk
    ->watch('my_queue')
    ->ignore('default')
    ->reserve())//retreives the job if there is one in the queue
    {
        echo $job->getData();//instead of echoing you would 
                             //have your query execute at this point

        $pheanstalk->delete($job);//deletes the job from the queue
    }
}
?>

You would have to do some changes like design how many workers you would have. You might put 1 worker in a while loop obtaining all the jobs and executing them 1 by one, and then call other worker scripts to help in the case that you see that you executed 3 and more are coming. There are many ways of managing the queue, but it is what is often used in situations like the one you described.

Another great benefit of queues from a library as recommended as pheanstalk is that it is very versatile. If in the future you decide you want to organize your workers differently, you can do so easily, and there are many functions that make your job easier. No reason to reinvent the wheel.

Upvotes: 1

Related Questions