Pak Hang Leung
Pak Hang Leung

Reputation: 389

Group rows with same name only under certain conditions with SQL in redshift

Now I have a table with webpage data look like this:

 Visitor_ID  Visit_ID  visit_time  pagepath         page seq    timestart        timeend
   0001       0111      2018-11-19  /homepage          1         ...              ...
   0001       0111      2018-11-19  /someotherpage     2         ...              ...
   0001       0111      2018-11-19  /product           3     2018-11-19 23:05  2018-11-19 23:15
   0001       0111      2018-11-19  /product           4     2018-11-19 23:15  2018-11-19 23:16
   0001       0111      2018-11-19  /product           5     2018-11-19 23:16  2018-11-19 23:17
   0001       0111      2018-11-19  /someotherpage     6         ...              ...   
   0001       0111      2018-11-19  /someotherpage     7         ...              ...  
   0001       0111      2018-11-19  /product           8     2018-11-19 23:25  2018-11-19 23:26

There is an error in the raw data, that some of the pages are duplicated, and what I want is to group the duplication, which is in sequence like the pages sequence 3,4,5 and sum up the time browse like:

 Visitor_ID  Visit_ID  visit_time  pagepath   page seq     timestart        timeend
   0001       0111      2018-11-19  /product      3     2018-11-19 23:05  2018-11-19 23:17
   ...
   ...
   0001       0111      2018-11-19  /product      8     2018-11-19 23:25  2018-11-19 23:26

I used the group by and take the minumum. However the problem is, in this case, page 8 is a vaild visit and it will also be grouped, which is not the desired result

I also tried to use lag and partition solution, but the problem is

In overall is, I want to group the record that with the same pagepath which is on sequence and sum them up, but at the same time keep the same pagepath which is not a duplicated record. Does anyone have any idea how to do so in redshift?

Many thanks for your help in advance

Upvotes: 1

Views: 61

Answers (1)

MatBailie
MatBailie

Reputation: 86706

Assuming that in your post ... means "and here is something different" (as opposed to the actual meaning of "here is more of the same") then you appear to want a "gaps-and-islands" solution.

I'll post the answer, you see if it works. But the explanation is long-winded and you'll need to search SO for other gaps-and-islands answers to find various explanations.

SELECT
  Visitor_ID,
  Visit_ID,
  MIN(visit_time)  AS visit_time,
  pagepath,
  MIN(page_seq)    AS page_seq,
  MIN(timestart)   AS timestart,
  MAX(timeend)     AS timeend
FROM
(
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY Visitor_ID, Visit_ID, pagepath
                           ORDER BY page_seq
                      )
                         AS pagepath_seq
  FROM
    yourTable
)
  AS sequenced
GROUP BY
  Visitor_ID,
  Visit_ID,
  pagepath,
  pagepath_seq - page_seq

(Makes the assumption that page_seq starts at 1, never has any gaps, and "restarts" for each Visitor_ID/Visit_ID.)

Example : https://dbfiddle.uk/?rdbms=postgres_8.4&fiddle=1d47a59b33bbfae11ba3040f392ec9c5

Upvotes: 1

Related Questions