Reputation: 389
Now I have a table with webpage data look like this:
Visitor_ID Visit_ID visit_time pagepath page seq timestart timeend
0001 0111 2018-11-19 /homepage 1 ... ...
0001 0111 2018-11-19 /someotherpage 2 ... ...
0001 0111 2018-11-19 /product 3 2018-11-19 23:05 2018-11-19 23:15
0001 0111 2018-11-19 /product 4 2018-11-19 23:15 2018-11-19 23:16
0001 0111 2018-11-19 /product 5 2018-11-19 23:16 2018-11-19 23:17
0001 0111 2018-11-19 /someotherpage 6 ... ...
0001 0111 2018-11-19 /someotherpage 7 ... ...
0001 0111 2018-11-19 /product 8 2018-11-19 23:25 2018-11-19 23:26
There is an error in the raw data, that some of the pages are duplicated, and what I want is to group the duplication, which is in sequence like the pages sequence 3,4,5 and sum up the time browse like:
Visitor_ID Visit_ID visit_time pagepath page seq timestart timeend
0001 0111 2018-11-19 /product 3 2018-11-19 23:05 2018-11-19 23:17
...
...
0001 0111 2018-11-19 /product 8 2018-11-19 23:25 2018-11-19 23:26
I used the group by and take the minumum. However the problem is, in this case, page 8 is a vaild visit and it will also be grouped, which is not the desired result
I also tried to use lag and partition solution, but the problem is
I can just get the row after
And the number of replication is different between record, so cannot solve it by fixed code
In overall is, I want to group the record that with the same pagepath which is on sequence and sum them up, but at the same time keep the same pagepath which is not a duplicated record. Does anyone have any idea how to do so in redshift?
Many thanks for your help in advance
Upvotes: 1
Views: 61
Reputation: 86706
Assuming that in your post ...
means "and here is something different" (as opposed to the actual meaning of "here is more of the same") then you appear to want a "gaps-and-islands" solution.
I'll post the answer, you see if it works. But the explanation is long-winded and you'll need to search SO for other gaps-and-islands
answers to find various explanations.
SELECT
Visitor_ID,
Visit_ID,
MIN(visit_time) AS visit_time,
pagepath,
MIN(page_seq) AS page_seq,
MIN(timestart) AS timestart,
MAX(timeend) AS timeend
FROM
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY Visitor_ID, Visit_ID, pagepath
ORDER BY page_seq
)
AS pagepath_seq
FROM
yourTable
)
AS sequenced
GROUP BY
Visitor_ID,
Visit_ID,
pagepath,
pagepath_seq - page_seq
(Makes the assumption that page_seq
starts at 1, never has any gaps, and "restarts" for each Visitor_ID/Visit_ID.)
Example : https://dbfiddle.uk/?rdbms=postgres_8.4&fiddle=1d47a59b33bbfae11ba3040f392ec9c5
Upvotes: 1