Frank
Frank

Reputation: 1206

Can I speed up calculations between R and Sqlite by using data.tables?

I have a sqlite database of about 1.4 million rows and 16 columns.

I have to run an operation on 80,000 id's :

  1. Get all rows associated with that id
  2. convert to R date object and sort by date
  3. calculate difference between 2 most recent dates

For each id I have been querying sqlite from R using dbSendQuery and dbFetch for step 1, while steps 2 and 3 are done in R. Is there a faster way? Would it be faster or slower to load the entire sqlite table into a data.table ?

Upvotes: 0

Views: 663

Answers (2)

Shawn
Shawn

Reputation: 52579

If you're doing all that in R and fetching rows from the database 80,0000 times in a loop... you'll probably have better results doing it all in one go in sqlite instead.

Given a skeleton table like:

CREATE TABLE data(id INTEGER, timestamp TEXT);
INSERT INTO data VALUES (1, '2019-07-01'), (1, '2019-06-25'), (1, '2019-06-24'),
                        (2, '2019-04-15'), (2, '2019-04-14');
CREATE INDEX data_idx_id_time ON data(id, timestamp DESC);

a query like:

SELECT id
     , julianday(first_ts)
       - julianday((SELECT max(d2.timestamp)
                    FROM data AS d2
                    WHERE d.id = d2.id AND d2.timestamp < d.first_ts)) AS days_difference
FROM (SELECT id, max(timestamp) as first_ts FROM data GROUP BY id) AS d
ORDER BY id;

will give you

id          days_difference
----------  ---------------
1           6.0
2           1.0

An alternative for modern versions of sqlite (3.25 or newer) (EDIT: On a test database with 16 million rows and 80000 distinct ids, it runs considerably slower than the above one, so you don't want to actually use it):

WITH cte AS
 (SELECT id, timestamp
       , lead(timestamp, 1) OVER id_by_ts AS next_ts
       , row_number() OVER id_by_ts AS rn
  FROM data
  WINDOW id_by_ts AS (PARTITION BY id ORDER BY timestamp DESC))
SELECT id, julianday(timestamp) - julianday(next_ts) AS days_difference
FROM cte
WHERE rn = 1
ORDER BY id;

(The index is essential for performance for both versions. Probably want to run ANALYZE on the table at some point after it's populated and your index(es) are created, too.)

Upvotes: 1

DSGym
DSGym

Reputation: 2867

I heavily depends on how you are working on that problem.

Normally loading the whole query inside the memory and then do the operation will be faster from what I have experienced and have seen on grahics, I can not show you a benchmark right now. If logically it makes hopefully sense, because you have to repeat several operations multiple times on multiple data.frames. As you can see here, 80k rows are pretty fast, faster than 3x 26xxx rows.

enter image description here

However you could have a look at the parallel package and use multiple cores on your machine to load subsets of your data and process them parallel, each on a multiple core.

Here you can find information how to do this: http://jaehyeon-kim.github.io/2015/03/Parallel-Processing-on-Single-Machine-Part-I

Upvotes: 1

Related Questions