The Great
The Great

Reputation: 7693

Equivalent of string contains in google bigquery

I have a table like as shown below

enter image description here

I would like to create two new binary columns indicating whether the subject had steroids and aspirin. I am looking to implement this in Postgresql and google bigquery

I tried the below but it doesn't work

select subject_id
case when lower(drug) like ('%cortisol%','%cortisone%','%dexamethasone%') 
then 1 else 0 end as steroids,
case when lower(drug) like ('%peptide%','%paracetamol%') 
then 1 else 0 end as aspirin,
from db.Team01.Table_1


SELECT 
db.Team01.Table_1.drug
FROM `table_1`,
UNNEST(table_1.drug) drug
WHERE REGEXP_CONTAINS( db.Team01.Table_1.drug,r'%cortisol%','%cortisone%','%dexamethasone%')

I expect my output to be like as shown below

enter image description here

Upvotes: 18

Views: 93780

Answers (6)

Luke Puplett
Luke Puplett

Reputation: 45105

I've not used BigQuery but have been reading the docs researching it. I came across this while looking into impact of choosing collation at design stage.

I'm either wrong or this is a new feature since answers above.

CONTAINS_SUBSTR

https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#contains_substr

Performs a normalized, case-insensitive search to see if a value exists as a substring in an expression. Returns TRUE if the value exists, otherwise returns FALSE.

Before values are compared, they are normalized and case folded with NFKC normalization. Wildcard searches are not supported.

Upvotes: 9

schustan235
schustan235

Reputation: 46

Another potentially more intutive solution would be to use the BigQuery Contains_Substr to return boolean results.

Upvotes: 2

Gordon Linoff
Gordon Linoff

Reputation: 1269563

In Postgres, I would recommend using the filter clause:

select subject_id,
       count(*) filter (where lower(drug) ~ 'cortisol|cortisone|dexamethasone') as steroids,
       count(*) filter (where lower(drug) ~ 'peptide|paracetamol') as aspirin,
from db.Team01.Table_1
group by subject_id;

In BigQuery, I would recommend countif():

select subject_id,
       countif(regexp_contains(drug, 'cortisol|cortisone|dexamethasone') as steroids,
       countif(drug ~ ' 'peptide|paracetamol') as aspirin,
from db.Team01.Table_1
group by subject_id;

You can use sum(case when . . . end) as a more general approach. However, each database has a more "local" way of expressing this logic. By the way, the FILTER clause is standard SQL, just not widely adopted.

Upvotes: 4

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172974

Below is for BigQuery Standard SQL

#standardSQL
SELECT 
  subject_id,
  SUM(CASE WHEN REGEXP_CONTAINS(LOWER(drug), r'cortisol|cortisone|dexamethasone') THEN 1 ELSE 0 END) AS steroids,
  SUM(CASE WHEN REGEXP_CONTAINS(LOWER(drug), r'peptide|paracetamol') THEN 1 ELSE 0 END) AS aspirin
FROM `db.Team01.Table_1`
GROUP BY subject_id   

if to apply to sample data from your question - result is

Row subject_id  steroids    aspirin  
1   1           3           1    
2   2           1           1     

Note: instead of simple LIKE ending with lengthy and redundant text - I am using LIKE on steroids - which is REGEXP_CONTAINS

Upvotes: 19

Ed Bangga
Ed Bangga

Reputation: 13006

you have missing group-by

select subject_id,
    sum(case when lower(drug) in ('cortisol','cortisone','dexamethasone')
       then 1 else 0 end) as steroids,
    sum(case when lower(drug) in ('peptide','paracetamol') 
       then 1 else 0 end) as aspirin
from db.Team01.Table_1
group by subject_id

using like keyword

select subject_id,
 sum(case when lower(drug) like '%cortisol%'
        or lower(drug) like '%cortisone%'
        or lower(drug) like '%dexamethasone%'   
    then 1 else 0 end) as steroids,
    sum(case when lower(drug) like '%peptide%'
        or lower(drug) like '%paracetamol%'
    then 1 else 0 end) as aspirin
from db.Team01.Table_1
group by subject_id

Upvotes: 1

GMB
GMB

Reputation: 222432

Use conditional aggregation. This is a solution that works across most (if not all) RDBMS:

SELECT
    subject_id,
    MAX(CASE WHEN drug IN ('cortisol', 'cortisone', 'dexamethasone') THEN 1 END) steroids,
    MAX(CASE WHEN drug IN ('peptide', 'paracetamol') THEN 1 END) aspirin
FROM db.Team01.Table_1.drug
GROUP BY subject_id

NB: it is unclear why you are using LIKE, since it seems like you are having exact matches; I turned the LIKE condition to equalities.

Upvotes: 1

Related Questions