Reputation: 59175
The BigQuery team strikes again: This question is not longer relevant, as the results with LOWER() are as fast as with REGEX() now.
Processing ~5GB of data with BigQuery should be super fast. For example the following query performs a case insensitive search in 18 seconds:
#standardSQL
SELECT COUNT(*) c
FROM `bigquery-public-data.hacker_news.full`
WHERE
LOWER(text) LIKE '%bigquery%' # 18s
Usually BigQuery is faster than this, but the real problem is that adding new search terms makes this query considerably slower (almost a minute with 3 search terms):
#standardSQL
SELECT COUNT(*) c
FROM `bigquery-public-data.hacker_news.full`
WHERE
LOWER(text) LIKE '%bigquery%' OR LOWER(text) LIKE '%big query%' # 34s
#standardSQL
SELECT COUNT(*) c
FROM `bigquery-public-data.hacker_news.full`
WHERE
LOWER(text) LIKE '%bigquery%' OR LOWER(text) LIKE '%big query%'
OR LOWER(text) LIKE '%google cloud%' # 52s
How can I improve my query performance?
Upvotes: 7
Views: 4384
Reputation: 59175
Note from the team: Stay tuned! Very soon BigQuery will turn this advice irrelevant.
LOWER()
and UPPER()
LOWER()
and UPPER()
operations have a hard time when dealing with Unicode text: each character needs to be mapped individually and they can also be multi-bytes.
A faster alternative: Use REGEX_MATCH()
and add the case insensitive (?i)
modifier to your regular expression
#standardSQL
SELECT COUNT(*) c
FROM `bigquery-public-data.hacker_news.full`
WHERE
REGEXP_CONTAINS(text, '(?i)bigquery') # 7s
# REGEXP_CONTAINS(text, '(?i)bigquery')
# OR REGEXP_CONTAINS(text, '(?i)big query') # 9s
# REGEXP_CONTAINS(text, '(?i)bigquery')
# OR REGEXP_CONTAINS(text, '(?i)big query')
# OR REGEXP_CONTAINS(text, '(?i)google cloud') # 11s
Performance is much better this way:
Why do 3 searches when a regular expression can combine many into 1?
#standardSQL
SELECT COUNT(*) c
FROM `bigquery-public-data.hacker_news.full`
WHERE
REGEXP_CONTAINS(text, '(?i)(bigquery|big query|google cloud)') # 7s
3 terms in 7s - nice.
This is uglier, but shows that UPPER()
and LOWER()
perform way better when dealing with individual bytes - for equivalent results in these searches:
#standardSQL
SELECT COUNT(*) c
FROM `bigquery-public-data.hacker_news.full`
WHERE
LOWER(CAST(text AS BYTES)) LIKE b'%bigquery%'
OR LOWER(CAST(text AS BYTES)) LIKE b'%big query%'
OR LOWER(CAST(text AS BYTES)) LIKE b'%google cloud%' # 7s
If this worked for you, please feel free to comment with your performance improvements.
Upvotes: 16