Jude Fernandes
Jude Fernandes

Reputation: 7527

Create API endpoint for fetching dynamic data based on time

I have a scraper which periodically scrapes articles from news sites and stores them in a database [MYSQL]. The way the scraping works is that the oldest articles are scraped first and then i move onto much more recent articles.

For example an article that was written on the 1st of Jan would be scraped first and given an ID 1 and an article that was scraped on the 2nd of Jan would have an ID 2.

So the recent articles would have a higher id as compared to older articles.

There are multiple scrapers running at the same time.

Now i need an endpoint which i can query based on timestamp of the articles and i also have a limit of 10 articles on each fetch.

The problem arises for example when there are 20 articles which were posted with a timestamp of 1499241705 and when i query the endpoint with a timestamp of 1499241705 a check is made to give me all articles that is >=1499241705 in which case i would always get the same 10 articles each time,changing the condition to a > would mean i skip out on the articles from 11-20. Adding another where clause to check on id is unsuccessful because articles may not always be inserted in the correct date order as the scraper is running concurrently.

Is there a way i can query this end point so i can always get consistent data from it with the latest articles coming first and then the older articles.

EDIT:

   +-----------------------+
   |   id | unix_timestamp |
   +-----------------------+
   |    1 |   1000         |
   |    2 |   1001         |
   |    3 |   1002         |
   |    4 |   1003         |
   |   11 |   1000         |
   |   12 |   1001         |
   |   13 |   1002         |
   |   14 |   1003         |
   +-----------------------+

The last timestamp and ID is being sent through the WHERE clause.

E.g. $this->db->where('unix_timestamp <=', $timestamp); $this->db->where('id <', $offset); $this->db->order_by('unix_timestamp ', 'DESC'); $this->db->order_by('id', 'DESC');

On querying with a timestamp of 1003, ids 14 and 4 are fetched. But then during the next call, id 4 would be the offset thereby not fetching id 13 and only fetching id 3 the next time around.So data would be missing .

Upvotes: 4

Views: 329

Answers (1)

Rick James
Rick James

Reputation: 142398

Two parts: timestamp and id.

WHERE   timestamp <= $ts_leftoff
  AND ( timestamp <  $ts_leftoff
            OR id <= $id_leftoff )
ORDER BY (timestamp DESC, id DESC)

So, assuming id is unique, it won't matter if lots of rows have the same timestamp, the order is fully deterministic.

There is a syntax for this, but unfortunately it is not well optimized:

WHERE (timestamp, id) <= ($ts_leftoff, $id_leftoff)

So, I advise against using it.

More on the concept of "left off": http://mysql.rjweb.org/doc.php/pagination

Upvotes: 2

Related Questions