Reputation: 21
I am new to SQL and I am communicating with a PostgreSQL database via queries. I have the following question: My (simplified) data table
looks like this.
DROP TABLE table;
CREATE TABLE table(
id INTEGER NOT NULL PRIMARY KEY
,date DATE NOT NULL
,key BIT NOT NULL
);
INSERT INTO table(id,date,key) VALUES (1,'18/02/05',0);
INSERT INTO table(id,date,key) VALUES (1,'20/02/05',1);
INSERT INTO table(id,date,key) VALUES (1,'21/02/05',0);
INSERT INTO table(id,date,key) VALUES (1,'10/04/06',0);
INSERT INTO table(id,date,key) VALUES (2,'09/05/08',0);
INSERT INTO table(id,date,key) VALUES (2,'17/06/08',1);
INSERT INTO table(id,date,key) VALUES (2,'22/06/08',1);
INSERT INTO table(id,date,key) VALUES (2,'23/06/08',1);
+----+------------+-----+ | id | date | key | +----+------------+-----+ | 1 | 2005-02-18 | 0 | | 1 | 2005-02-20 | 1 | | 1 | 2005-02-21 | 0 | | 1 | 2006-04-10 | 0 | | 2 | 2008-05-09 | 0 | | 2 | 2008-06-17 | 1 | | 2 | 2008-06-22 | 1 | | 2 | 2008-06-23 | 1 | +----+------------+-----+
Where id
identifies different groups in my data, date
(formatted as date column) indicates the date a particular event occurred and key
identifies important events in my data set.
Now, I need to conduct the following tasks for each group of observations.
A) Count the number of past key events in a particular time window for each date entry (let's say 7 days for the moment), in other words: for every date entry: How many times did a key event occur in the last 7 days (count key=1 for date-7 days) Comment: this is how it looks like in stata
B) Calculate the time difference in days between each event and the most recent key event, (date - last(date where key=1) =x. (ANSWERED, check out Gordon's post) The final result should look like this:
+----+------------+-----+--------+-----------+ | id | date | key | number | time_diff | +----+------------+-----+--------+-----------+ | 1 | 2005-02-18 | 0 | 0 | NA | | 1 | 2005-02-20 | 1 | 0 | 0 | | 1 | 2005-02-21 | 0 | 1 | 1 | | 1 | 2006-04-10 | 0 | 0 | 413 | | 2 | 2008-05-09 | 0 | 0 | NA | | 2 | 2008-06-17 | 1 | 0 | 0 | | 2 | 2008-06-22 | 1 | 1 | 5 | | 2 | 2008-06-23 | 1 | 2 | 1 | +----+------------+-----+--------+-----------+
All events that occurred before the first key event in a particular group should be tagged as NULL or NA.
I tried to solve B with help of this blog but I am using Postgresql 9.3. and the FILTER clause is a feature of v9.4 if I am not mistaken.
My idea was to try the following:
dataset <- dbGetQuery(channel, "SELECT t1.*, t1.date -
(
SELECT MIN(t2.date)
FROM table t2
WHERE t1.id = t2.id AND t2.key==1
AND t1.date-t2.date <= 7 AND t1.date-t2.date >= 0
) AS time_diff FROM table t1 ORDER BY t1.id, t1.date")
But the result was not very satisfying, if there is more than one key event in my time window. I assume that I need to use a window function, specify my key events as FIRST_VALUE or set some kind of time interval but I have no clear idea how to implement this in order to reach the desired result. As you can see, I am using R to send the query to the DB.
Any help is appreciated. In case you need additional information, please let me know, especially since this is my first question on SO.
Upvotes: 2
Views: 825
Reputation: 1269445
Your question seems to be around "B" rather than "A".
You can do "B" with window functions, but it involves a conditional forward max scan on the date, rather than lag()
:
select t.*,
(date -
max(case when key = 1 then date end) over (partition by id order by date)
) as time_diff
from t;
Upvotes: 1