AnaHid
AnaHid

Reputation: 37

Writing a String comparison function in bigquery

I am trying to write a function with bigquery UDF to compare a list of string with other list of strings. basically I would like to know how many new users do we have per week and from these new Users how many of them kept visiting our website in future weeks. For that I created a query which gives me a String of all emails per week (with group_concat) and saved it as a table. now need to know how can I compare each with Other collections of emails per week. At the end, I would like to have a table like this :

+----------------+-------+-------+--------+------+
|       | week 1 | week 2 | week 3| week 4 | ... |
+----------------+-------+-------+--------+------+
| week1 |   17   |    7   |   5   |   9    | ... |
+----------------+-------+-------+--------+------+
| week2 |        |   19   |  13   |   8    | ... |  
+-----------------+-------+-------+--------+-----+
| week3 |        |        |  24   |   15   | ... |
+-----------------+-------+-------+--------+-----+

Upvotes: 0

Views: 1472

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

Just to give you an idea to play with

SELECT 
  CONCAT('week', STRING(prev)) AS WEEK,
  SUM(IF(next=19, authors, 0)) AS week19,
  SUM(IF(next=20, authors, 0)) AS week20,
  SUM(IF(next=21, authors, 0)) AS week21,
  SUM(IF(next=22, authors, 0)) AS week22,
  SUM(IF(next=23, authors, 0)) AS week23
FROM (
  SELECT prev, next, COUNT(author) AS authors
  FROM (
    SELECT
      prev_week.week_created AS prev,
      next_week.week_created AS next,
      prev_week.author AS author
    FROM (
      SELECT  
        WEEK(SEC_TO_TIMESTAMP(created_utc)) AS week_created,
        author
      FROM [fh-bigquery:reddit_posts.2016_05] 
      GROUP BY 1,2
    ) next_week
    LEFT JOIN (
      SELECT  
        WEEK(SEC_TO_TIMESTAMP(created_utc)) AS week_created,
        author
      FROM [fh-bigquery:reddit_posts.2016_05] 
      GROUP BY 1,2
    ) AS prev_week
    ON prev_week.author = next_week.author
    HAVING prev <= next
  )
  GROUP BY 1,2
)
GROUP BY 1
ORDER BY 1

Result is as below
enter image description here

This is the closest to what you asked i can think of

Meantime, please note - BigQuery is less tailored for reports design rather for data crunching. So I think that creating matrix/pivot within BigQuery (outer select) is not the best fit - it can be done in your reporting tool. But calculating all pairs prev|next|count (inner select) is definitely suitable here in BigQuery

Upvotes: 2

Related Questions