Unable to translate BigQuery legacy SQL to standard SQL for HAVING LEFT(...)

Question

I would like to use BigQuery Standard SQL for a query like this one:

SELECT package, COUNT(*) count
FROM (
  SELECT REGEXP_EXTRACT(line, r' ([a-z0-9\._]*)\.') package, id
  FROM (
    SELECT SPLIT(content, '
') line, id
    FROM [github-groovy-files:github.contents] 
    WHERE content CONTAINS 'import'
    HAVING LEFT(line, 6)='import' )
    GROUP BY package, id
  )
GROUP BY 1
ORDER BY count DESC
LIMIT 30;

I cannot get past something like this (works but doesn't GROUP or COUNT):

with lines as 
(SELECT SPLIT(c.content, '
') line, c.id as id
      FROM `.contents` c, `.files` f 
      WHERE c.id = f.id AND f.path LIKE '%.groovy')
select 
  array(select REGEXP_REPLACE(l, r'import |;', '') AS class from unnest(line) as l where l like 'import %') imports, id
from lines;

LEFT() is not in Standard SQL and there doesn't seem to be a function that will accept and array type.

Mikhail Berlyant · Accepted Answer

LEFT() is not in Standard SQL ...

In BigQuery Standard SQL you can use SUBSTR(value, position[, length]) instead of Legacy's LEFT

... and there doesn't seem to be a function that will accept and array type.

There are plenty of Array's related functions as well as functions that accept array as argument - for example UNNEST()

I would like to use BigQuery Standard SQL for a query like this one:

Below is equivalent query for BigQuery Standard SQL

SELECT package, COUNT(*) COUNT
FROM (
  SELECT REGEXP_EXTRACT(line, r' ([a-z0-9\._]*)\.') package, id
  FROM (
    SELECT line, id
    FROM `github-groovy-files.github.contents`,
    UNNEST(SPLIT(content, '
')) line
    WHERE SUBSTR(line, 1, 6)='import' 
  )
  GROUP BY package, id
)
GROUP BY 1
ORDER BY COUNT DESC
LIMIT 30

Instead of WHERE SUBSTR(line, 1, 6)='import' you can use WHERE line LIKE 'import%'

Also note, this query can be written in number of ways - so in my above example I focused on "translating" your query into from legacy to standard sql while preserving core structure and approach of original query

But if you woukld like to rewrite it using power of Standard SQL - you would ended up with something like below

SELECT REGEXP_EXTRACT(line, r' ([a-z0-9\._]*)\.') package, COUNT(DISTINCT id) count
FROM `github-groovy-files.github.contents`,
UNNEST(SPLIT(content, '
')) line
WHERE line LIKE 'import%' 
GROUP BY 1
ORDER BY count DESC
LIMIT 30

Unable to translate BigQuery legacy SQL to standard SQL for HAVING LEFT(...)

Answers (1)

Related Questions