Gordo
Gordo

Reputation: 23

Page Hits - Not Preceded by

I am trying to come up with a way to locate visits where a user went to a certain hits.page.hostname, but the previous hits.page.hostname is not a certain host. I was looking to use GROUP_CONCAT (like shown at 'Immediate Follow' Page Path in BigQuery) and regex with lookbehind, but I discovered that RE2 doesn't have lookbehind so I'm back to square one.

Has anyone got any ideas?

Upvotes: 1

Views: 184

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173046

Below is attempt to "save" your original idea of using GROUP_CONCAT with later REGEX_MATCH

Try below

#legacySQL
SELECT 
  date, 
  totals.visits AS totalvisits, 
  GROUP_CONCAT(hits.page.hostname, ">>") WITHIN RECORD AS sequence 
FROM TABLE_DATE_RANGE([MySys:110909631.ga_sessions_],
     DATE_ADD(CURRENT_TIMESTAMP(), -90, 'DAY'), 
     DATE_ADD(CURRENT_TIMESTAMP(), -1, 'DAY')) 
HAVING REGEXP_MATCH(CONCAT('>>', sequence, '>>'), '>>(DesiredHost>>)+')
AND NOT REGEXP_MATCH(CONCAT('>>', sequence, '>>'), '>>(RejectHost>>)+(DesiredHost>>)+')  

Above looks for visits with DesiredHost in hits.page.hostname

REGEXP_MATCH(CONCAT('>>', sequence, '>>'), '>>(DesiredHost>>)+')  

but excludes those visits where DesiredHost is preceeded with RejectHost

REGEXP_MATCH(CONCAT('>>', sequence, '>>'), '>>(RejectHost>>)+(DesiredHost>>)+')  

As you can see in both expressions above - you can fine tune length of visit within host or set any other more complex logic to look for. Hope you get an idea.

And hope you can get your initial idea back to work!

Upvotes: 0

Elliott Brossard
Elliott Brossard

Reputation: 33755

I think you'd have an easier time using standard SQL. You can look "backwards" through an array by using UNNEST WITH OFFSET and then the array operator. For example,

#standardSQL
SELECT
  date,
  totals.visits AS total_visits
FROM
  `MyProject.dataset.ga_sessions_*`
WHERE
  PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) BETWEEN
    DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY) AND
    DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) AND
EXISTS (
  SELECT 1 FROM UNNEST(hits) AS hit WITH OFFSET off
  WHERE hit.page.hostname = 'foo' AND
    NOT hits[SAFE_OFFSET(off - 1)].page.hostname = 'bar'
);

Upvotes: 1

Related Questions