Pat396
Pat396

Reputation: 39

Why is my SQL statement so slow (sqlite 3)?

So I'm using python sqlite3 to connect to a locally saved database on my laptop and query tables. The tables are fairly large and have up to 5 million rows (minutely forex exchange rates).

I have 3 relevant tables in the database ("EURUSD", "BTCUSD", "ETHUSD") each with the columns "Timestamp" and "Open", which corresponds to the rate at a specific minute (the format is Unix Timestamp.

I have another table in python whick looks like this: Table to be filled with values

What I'm trying to do is the following: For every row in the table to be filled: Check the currency and search for the NEAREST timestamp value in the database and return the corresponding "Open". So if the currency in the table is "EUR". Go to the table "EURUSD", select the nearest timestamp and return the open rate. If the currency is "BTC", go to the table "BTCUSD", select the nearest timestamp and return the open.

It has to be the nearest timestamp since the exact timestamp may not be in the database.

My current code looks like this, but I'm iterating over this code 100's of times for every row and this makes it slow.

conn = sqlite3.connect(path)
cur = conn.cursor()
for idx in dataframe.index:
  timestamp = dataframe.loc[idx, "Timestamp"]
  cur.execute("SELECT TimestampUnix, Open FROM EURUSD WHERE EURUSD.TimestampUnix< "+str(timestamp)
  ORDER BY EURUSD.TimestampUnix DESC LIMIT 1"
  query = cur.fetchall()
  rate = query[0][1]
  dataframe.loc[idx, "ForexRate"] = rate

I indexed the table on "TimestampUnix" which dropped the runtime for one query from 5s down to 1s. But still, it seems to long for a locally executed query.

My thought was that the SQL query could be the problem since it selects nearly the whole table in some cases and has to order it. Do you have any other ideas to make it faster?

EDIT: I edited the description of what I want to do. I think (as some of you already mentionend) my solution is completely inefficient and maybe this could be solved with a single query. The query should essentially do two things:

Since I'm not proficient in SQL I'm having trouble with this and would appreciate the help.

Upvotes: 0

Views: 757

Answers (1)

Jon Armstrong
Jon Armstrong

Reputation: 4694

Here's an updated response based on new detail in the question. The application based table can be inserted in the database before performing the following, or the data could, if necessary, be injected via a CTE term.

The test case:

Working test case for sqlite

The SQL of interest, first just the SELECT to generate a result the application could then use to update local data, followed by an example which updates a target table created in the database which contains the application data to be populated with new Open/rate data.

The literal 3000000000 is just the timestamp delta / range over which we search for Open detail from the provided application timestamp of interest, for each row provided by the application. Basically, we find the nearest prior timestamp from the EURUSD table.

-- Just calculate the result, without touching target
WITH RECURSIVE ts (ts, n) AS (
         SELECT TimestampUnix, ROW_NUMBER() OVER (ORDER BY TimestampUnix)
           FROM target WHERE currency = 'EUR'
     )
   , xrows (ts, n, TimestampUnix, Open, rn) AS (
         SELECT ts, n, EURUSD.TimestampUnix, EURUSD.Open
              , ROW_NUMBER() OVER (PARTITION BY n ORDER BY TimestampUnix DESC)
           FROM ts
           JOIN EURUSD
             ON EURUSD.TimestampUnix <=  ts.ts
            AND EURUSD.TimestampUnix >   ts.ts - 3000000000
     )
SELECT *
  FROM xrows WHERE rn = 1
;
-- Now use the above to UPDATE the target table.
-- sqlite might not support a JOIN here.
-- So we use correlated behavior in the SET clause

UPDATE target
   SET rate = (
            WITH ts (ts, n, currency) AS (
                     SELECT TimestampUnix, ROW_NUMBER() OVER (ORDER BY TimestampUnix)
                          , currency
                       FROM target WHERE currency = 'EUR'
                 )
               , xrows (ts, n, currency, TimestampUnix, Open, rn) AS (
                     SELECT ts, n, currency, EURUSD.TimestampUnix, EURUSD.Open
                          , ROW_NUMBER() OVER (PARTITION BY n ORDER BY TimestampUnix DESC)
                       FROM ts
                       JOIN EURUSD
                         ON EURUSD.TimestampUnix <=  ts.ts
                        AND EURUSD.TimestampUnix >   ts.ts - 3000000000
                 )
            SELECT Open
              FROM xrows
             WHERE rn = 1
               AND ts       = target.TimestampUnix
               AND currency = target.currency
       )
 WHERE currency = 'EUR'
;

The eurusd table and some generated sample data:

CREATE TABLE eurusd (
     TimestampUnix  timestamp, Open  int
);

-- Add some sample data for testing...
INSERT INTO eurusd
WITH RECURSIVE cte (ts, Open, n) AS (
        SELECT 1630942385000, 1101, 1 UNION ALL
        SELECT ts-1000000000, Open-1, n+1 FROM cte
         WHERE n < 15
     )
SELECT ts, Open FROM cte
;

The data generated looks like this:

enter image description here

The target / application table:

-- Let this table represent the data in the application to be updated.
CREATE TABLE target (
     TimestampUnix  timestamp
   , currency       varchar(10)
   , rate           int
);

-- The application could insert these rows before processing, or just provide this
-- data via a CTE term.
INSERT INTO target VALUES
    (1630942385000, 'TST',    0)
  , (1630942377000, 'EUR', NULL)
  , (1630942162000, 'BTC', NULL)
  , (1625942399000, 'EUR', NULL)
;

This table looks like this. We expect to update the highlighted rows:

enter image description here

The result of the first query, which just shows the generated rows with the closest rate found:

enter image description here

The result of the target table after the UPDATE, with updated rates highlighted:

enter image description here

Original response: Here's one way to calculate the timestamps of interest (every -12 hours starting with the current_timestamp) and then return 1 row per timestamp, based on the logic in the question.

We start with the current_timestamp - 12 hours and iteratively (via recursion) produce successive timestamp values from which to calculate the next result, until some stop condition. I've added a condition to limit the number of iterations to about 10, but there is data only for a couple of days, which is 4 x 12 hour ranges.

ROW_NUMBER is used to calculate the position of each EURUSD row within each timestamp range (descending), so that rn = 1 represents that most recent row in the range, similar to your ORDER BY ts DESC LIMIT 1.

I also adjusted your logic to avoid searching all data prior to the given timestamp. If this contains months or years of data, it's likely you only need to review the last day or so, to get that most recent Open value per timestamp. If that is acceptable, this could help your current SQL too.

It's also likely you'll get more significant improvement by just not iterating over your SQL 100's of times.

Updated with a slightly better form:

WITH RECURSIVE ts (ts, n) AS (
         SELECT datetime(current_timestamp, '-12 hour'), 1
          UNION ALL
         SELECT datetime(ts, '-12 hour'), n+1
           FROM ts
          WHERE n < 10
     )
   , xrows (ts, n, TimestampUnix, Open, rn) AS (
         SELECT ts, n, EURUSD.TimestampUnix, EURUSD.Open
              , ROW_NUMBER() OVER (PARTITION BY n ORDER BY TimestampUnix DESC)
           FROM ts
           JOIN EURUSD
             ON EURUSD.TimestampUnix <  datetime(ts.ts, '+12 hour')
            AND EURUSD.TimestampUnix >= ts.ts
     )
SELECT *
  FROM xrows WHERE rn = 1
;

This is just an example. The following fiddle contains some data for testing:

Working Test Case with data

Data used:

enter image description here

Result based on that data:

enter image description here

Upvotes: 1

Related Questions