DevDavid
DevDavid

Reputation: 325

How to get yes/no statistics from SQL of how often strings occur each

Is there a way to query a table from BigQuery project HTTPArchive by checking how often certain strings occur by a certain file type?

I was able to write a query for a single check but how to perform this query on multiple strings at once without needing to send the same query every time just with a different string check and process the ~800GB of table data every time?

Getting the results as array might work somehow? I want to publish in-depth monthly statistics to the public for free so the option to send those queries separately and get billed for querying of roughly $2000/month is no option for me as a student.

SELECT matched, count(*) AS total, RATIO_TO_REPORT(total) OVER() AS ratio
FROM (
    SELECT url, (LOWER(body) CONTAINS 'document.write') AS matched
    FROM httparchive.har.2017_09_01_chrome_requests_bodies
    WHERE url LIKE "%.js"
)
GROUP BY matched

Please note that this is just one example of many (~50) and the pre-generated stats are not what I am looking for as it doesn't contain the needed information.

Upvotes: 0

Views: 169

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173121

Below is for BigQuery Standard SQL

#standardSQL
WITH strings AS (
  SELECT LOWER(str) str FROM UNNEST(['abc', 'XYZ']) AS str
), files AS (
  SELECT LOWER(ext) ext FROM UNNEST(['JS', 'go', 'php'])AS ext
)
SELECT 
  ext, str, COUNT(1) total, 
  COUNTIF(REGEXP_CONTAINS(LOWER(body), str)) matches, 
  ROUND(COUNTIF(REGEXP_CONTAINS(LOWER(body), str)) / COUNT(1), 3) ratio
FROM `httparchive.har.2017_09_01_chrome_requests_bodies` b
JOIN files f ON LOWER(url) LIKE CONCAT('%.', ext)
CROSS JOIN strings s 
GROUP BY ext, str
-- ORDER BY ext, str

You can test / play with above using [totally] dummy data as below

#standardSQL
WITH `httparchive.har.2017_09_01_chrome_requests_bodies` AS (
  SELECT '1234.js' AS url, 'abc=1;x=2' AS body UNION ALL  
  SELECT 'qaz.js', 'y=1;xyz=0' UNION ALL
  SELECT 'edc.go', 's=1;xyz=2;abc=3' UNION ALL
  SELECT 'edc.go', 's=1;xyz=4;abc=5' UNION ALL
  SELECT 'rfv.php', 'd=1' UNION ALL
  SELECT 'tgb.txt', '?abc=xyz' UNION ALL
  SELECT 'yhn.php', 'like v' UNION ALL
  SELECT 'ujm.go', 'lkjsad' UNION ALL
  SELECT 'ujm.go', 'yhj' UNION ALL
  SELECT 'ujm.go', 'dfgh' UNION ALL
  SELECT 'ikl.js', 'werwer'   
), strings AS (
  SELECT LOWER(str) str FROM UNNEST(['abc', 'XYZ']) AS str
), files AS (
  SELECT LOWER(ext) ext FROM UNNEST(['JS', 'go', 'php'])AS ext
)
SELECT 
  ext, str, COUNT(1) total, 
  COUNTIF(REGEXP_CONTAINS(LOWER(body), str)) matches, 
  ROUND(COUNTIF(REGEXP_CONTAINS(LOWER(body), str)) / COUNT(1), 3) ratio
FROM `httparchive.har.2017_09_01_chrome_requests_bodies` b
JOIN files f ON LOWER(url) LIKE CONCAT('%.', ext)
CROSS JOIN strings s 
GROUP BY ext, str
ORDER BY ext, str

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1270713

One method is to bring in a table with the different strings. This is the idea:

SELECT str, matched, count(*) AS total, RATIO_TO_REPORT(total) OVER() AS ratio
FROM (SELECT crb.url, s.str, (LOWER(crb.body) CONTAINS s.str) AS matched
      FROM httparchive.har.2017_09_01_chrome_requests_bodies crb CROSS JOIN
           (SELECT 'document.write' as str UNION ALL
            SELECT 'xxx' as str
           ) s
      WHERE url LIKE "%.js"
     )
GROUP BY str, matched;

You would just add more strings to s.

Upvotes: 1

Related Questions