Felipe Hoffa
Felipe Hoffa

Reputation: 59175

Improve BigQuery case insensitive search performance

The BigQuery team strikes again: This question is not longer relevant, as the results with LOWER() are as fast as with REGEX() now.


Processing ~5GB of data with BigQuery should be super fast. For example the following query performs a case insensitive search in 18 seconds:

#standardSQL
SELECT COUNT(*) c
FROM `bigquery-public-data.hacker_news.full` 
WHERE 
  LOWER(text) LIKE '%bigquery%' # 18s

Usually BigQuery is faster than this, but the real problem is that adding new search terms makes this query considerably slower (almost a minute with 3 search terms):

#standardSQL
SELECT COUNT(*) c
FROM `bigquery-public-data.hacker_news.full` 
WHERE 
  LOWER(text) LIKE '%bigquery%' OR LOWER(text) LIKE '%big query%' # 34s

#standardSQL
SELECT COUNT(*) c
FROM `bigquery-public-data.hacker_news.full` 
WHERE 
  LOWER(text) LIKE '%bigquery%' OR LOWER(text) LIKE '%big query%'
  OR LOWER(text) LIKE '%google cloud%' # 52s

How can I improve my query performance?

Upvotes: 7

Views: 4384

Answers (1)

Felipe Hoffa
Felipe Hoffa

Reputation: 59175

Note from the team: Stay tuned! Very soon BigQuery will turn this advice irrelevant.

BigQuery performance tip: Avoid using LOWER() and UPPER()

LOWER() and UPPER() operations have a hard time when dealing with Unicode text: each character needs to be mapped individually and they can also be multi-bytes.

Solution 1: Case insensitive regex

A faster alternative: Use REGEX_MATCH() and add the case insensitive (?i) modifier to your regular expression

#standardSQL
SELECT COUNT(*) c
FROM `bigquery-public-data.hacker_news.full` 
WHERE 
   REGEXP_CONTAINS(text, '(?i)bigquery') # 7s

#  REGEXP_CONTAINS(text, '(?i)bigquery')
#   OR REGEXP_CONTAINS(text, '(?i)big query') # 9s

#  REGEXP_CONTAINS(text, '(?i)bigquery') 
#   OR REGEXP_CONTAINS(text, '(?i)big query') 
#   OR REGEXP_CONTAINS(text, '(?i)google cloud') # 11s

Performance is much better this way:

  • 1 search term: 18s down to 8s
  • 2 search terms: 34s down to 9s
  • 3 search terms: 52s down to 11s.

Solution 2: Combine regexes

Why do 3 searches when a regular expression can combine many into 1?

#standardSQL
SELECT COUNT(*) c
FROM `bigquery-public-data.hacker_news.full` 
WHERE 
  REGEXP_CONTAINS(text, '(?i)(bigquery|big query|google cloud)') # 7s

3 terms in 7s - nice.

Solution 3: Transform to bytes

This is uglier, but shows that UPPER() and LOWER() perform way better when dealing with individual bytes - for equivalent results in these searches:

#standardSQL
SELECT COUNT(*) c
FROM `bigquery-public-data.hacker_news.full` 
WHERE 
  LOWER(CAST(text AS BYTES)) LIKE b'%bigquery%'
  OR LOWER(CAST(text AS BYTES)) LIKE b'%big query%' 
  OR LOWER(CAST(text AS BYTES)) LIKE b'%google cloud%' # 7s

LOWER() is slower. Use the regex (?i) modifier instead.

If this worked for you, please feel free to comment with your performance improvements.

Upvotes: 16

Related Questions