Reputation: 7693

Equivalent of string contains in google bigquery

I have a table like as shown below

I would like to create two new binary columns indicating whether the subject had steroids and aspirin. I am looking to implement this in Postgresql and google bigquery

I tried the below but it doesn't work

select subject_id
case when lower(drug) like ('%cortisol%','%cortisone%','%dexamethasone%') 
then 1 else 0 end as steroids,
case when lower(drug) like ('%peptide%','%paracetamol%') 
then 1 else 0 end as aspirin,
from db.Team01.Table_1


SELECT 
db.Team01.Table_1.drug
FROM `table_1`,
UNNEST(table_1.drug) drug
WHERE REGEXP_CONTAINS( db.Team01.Table_1.drug,r'%cortisol%','%cortisone%','%dexamethasone%')

I expect my output to be like as shown below

Upvotes: 18

Answers (6)

Luke Puplett

Reputation: 45105

I've not used BigQuery but have been reading the docs researching it. I came across this while looking into impact of choosing collation at design stage.

I'm either wrong or this is a new feature since answers above.

CONTAINS_SUBSTR

https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#contains_substr

Performs a normalized, case-insensitive search to see if a value exists as a substring in an expression. Returns TRUE if the value exists, otherwise returns FALSE.

Before values are compared, they are normalized and case folded with NFKC normalization. Wildcard searches are not supported.

Upvotes: 9

schustan235

Reputation: 46

Another potentially more intutive solution would be to use the BigQuery Contains_Substr to return boolean results.

Upvotes: 2

Gordon Linoff

Reputation: 1269563

In Postgres, I would recommend using the filter clause:

select subject_id,
       count(*) filter (where lower(drug) ~ 'cortisol|cortisone|dexamethasone') as steroids,
       count(*) filter (where lower(drug) ~ 'peptide|paracetamol') as aspirin,
from db.Team01.Table_1
group by subject_id;

In BigQuery, I would recommend countif():

select subject_id,
       countif(regexp_contains(drug, 'cortisol|cortisone|dexamethasone') as steroids,
       countif(drug ~ ' 'peptide|paracetamol') as aspirin,
from db.Team01.Table_1
group by subject_id;

You can use sum(case when . . . end) as a more general approach. However, each database has a more "local" way of expressing this logic. By the way, the FILTER clause is standard SQL, just not widely adopted.

Upvotes: 4

Mikhail Berlyant

Reputation: 172974

Below is for BigQuery Standard SQL

#standardSQL
SELECT 
  subject_id,
  SUM(CASE WHEN REGEXP_CONTAINS(LOWER(drug), r'cortisol|cortisone|dexamethasone') THEN 1 ELSE 0 END) AS steroids,
  SUM(CASE WHEN REGEXP_CONTAINS(LOWER(drug), r'peptide|paracetamol') THEN 1 ELSE 0 END) AS aspirin
FROM `db.Team01.Table_1`
GROUP BY subject_id

if to apply to sample data from your question - result is

Row subject_id  steroids    aspirin  
1   1           3           1    
2   2           1           1

Note: instead of simple LIKE ending with lengthy and redundant text - I am using LIKE on steroids - which is REGEXP_CONTAINS

Upvotes: 19

Ed Bangga

Reputation: 13006

you have missing group-by

select subject_id,
    sum(case when lower(drug) in ('cortisol','cortisone','dexamethasone')
       then 1 else 0 end) as steroids,
    sum(case when lower(drug) in ('peptide','paracetamol') 
       then 1 else 0 end) as aspirin
from db.Team01.Table_1
group by subject_id

using like keyword

select subject_id,
 sum(case when lower(drug) like '%cortisol%'
        or lower(drug) like '%cortisone%'
        or lower(drug) like '%dexamethasone%'   
    then 1 else 0 end) as steroids,
    sum(case when lower(drug) like '%peptide%'
        or lower(drug) like '%paracetamol%'
    then 1 else 0 end) as aspirin
from db.Team01.Table_1
group by subject_id

Upvotes: 1

GMB

Reputation: 222432

Use conditional aggregation. This is a solution that works across most (if not all) RDBMS:

SELECT
    subject_id,
    MAX(CASE WHEN drug IN ('cortisol', 'cortisone', 'dexamethasone') THEN 1 END) steroids,
    MAX(CASE WHEN drug IN ('peptide', 'paracetamol') THEN 1 END) aspirin
FROM db.Team01.Table_1.drug
GROUP BY subject_id

NB: it is unclear why you are using LIKE, since it seems like you are having exact matches; I turned the LIKE condition to equalities.

Upvotes: 1

Equivalent of string contains in google bigquery

Answers (6)

CONTAINS_SUBSTR

Related Questions