user1554862
user1554862

Reputation: 23

Creating cumulative moving average on SQLite

I am trying to create a cumulative moving average in SQLite.

As a refresher, in a cumulative moving average (CMA), the data arrives in an ordered datum stream, and I would like to get the average of all of the data up until the current datum point.

My Table looks like:

Continent,Date,Measure,Value
Antarctica,03/01/2019 12:00:00 AM,Passengers,346158
South America,03/01/2019 12:00:00 AM,Ships,6483
South America,03/01/2019 12:00:00 AM,Flights,19
Antarctica,02/01/2019 12:00:00 AM,Passengers,172163
South America,02/01/2019 12:00:00 AM,Cargo Ships,1319
Antarctica,01/01/2019 12:00:00 AM,Passengers,56810

Previous solutions such as 1 or 2 describe a moving average monthly or weekly. However, while I can hold this average month to month, I am trying to build a cumulative average.

I tried doing this:

SELECT T1.Date, AVG(T2.VALUE) from my_table AS T1 INNER JOIN my_table AS T2 ON datetime(T1.Date, '-1 Month') <= datetime(T2.Date) 
  AND datetime(T1.Date, '+1 Month') >= datetime(T2.Date) GROUP BY
  T1.date;

But when I use sqlite, the datetime operation produces an error: sqlite does not have operation datetime.

I even tried the simple command: SELECT AVG(VALUE) FROM my_table GROUP BY MEASURE, DATE, CONTINENT, but that grouped by the moving average, which did not solve my issue.

What I am looking to do:

Continent,Date,Measure,Value,Average
Antarctica,03/01/2019 12:00:00 AM,Passengers,346158,114487
South America,03/01/2019 12:00:00 AM,Ships,6483,0
South America,03/01/2019 12:00:00 AM,Flights,19,0
Antarctica,02/01/2019 12:00:00 AM,Passengers,172163,56810
South America,02/01/2019 12:00:00 AM,Cargo Ships,1319,0
Antarctica,01/01/2019 12:00:00 AM,Passengers,56810,0

The column, Average, is for the running monthly average of total crossings to the Continent and means of crossing in all previous months. So, to calculate the Average for the first line (i.e., running monthly average of total Passengers crossing Antarctica in all of the months preceding March), you'd take the average sum of total number of Antarctica passenger crossings in February 156,891 + 15,272 = 172,163 and January 56,810, and round it to the nearest whole number round(228,973/2) = 114,487.

Is there an easier to way to go about this?

Upvotes: 1

Views: 313

Answers (1)

Shawn
Shawn

Reputation: 52579

First, fix your timestamps so they are in a format that can be sorted, like the ISO-8601 ones that sqlite date and time functions support. Instead of 03/01/2019 12:00:00 AM, use 2019-03-01 00:00:00 (Or just 2019-03-01 if you don't care about the time, just the date). This leaves your CSV data looking like:

Continent,Date,Measure,Value
Antarctica,2019-03-01 00:00:00,Passengers,346158
South America,2019-03-01 00:00:00,Ships,6483
South America,2019-03-01 00:00:00,Flights,19
Antarctica,2019-02-01 00:00:00,Passengers,172163
South America,2019-02-01 00:00:00,Cargo Ships,1319
Antarctica,2019-01-01 00:00:00,Passengers,56810

Then you can use a window function (Introduced in Sqlite 3.25) to easily calculate the cumulative average of previous months:

SELECT continent, date, measure, value,
       cast(round(ifnull(avg(value)
                          OVER (PARTITION BY continent, measure
                                ORDER BY date
                                ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING),
                         0),
                  0) AS INTEGER) AS Average
FROM crossings
ORDER BY date DESC, continent, measure DESC;

which gives

Continent   Date                 Measure     Value       Average   
----------  -------------------  ----------  ----------  ----------
Antarctica  2019-03-01 00:00:00  Passengers  346158      114487 
South Amer  2019-03-01 00:00:00  Ships       6483        0    
South Amer  2019-03-01 00:00:00  Flights     19          0   
Antarctica  2019-02-01 00:00:00  Passengers  172163      56810   
South Amer  2019-02-01 00:00:00  Cargo Ship  1319        0     
Antarctica  2019-01-01 00:00:00  Passengers  56810       0     

If stuck on an older version without window function support, you can use a correlated subquery to calculate the cumulative average:

SELECT continent, date, measure, value,
       ifnull((SELECT cast(round(avg(c2.value), 0) AS INTEGER)
               FROM crossings AS c2
               WHERE c2.continent = c.continent
                 AND c2.measure = c.measure
                 AND c2.date < c.date),
              0) AS Average
FROM crossings AS c
ORDER BY date DESC, continent, measure DESC;

Both versions will benefit from an index on (continent, measure, date).

Upvotes: 2

Related Questions