Reputation: 7693
I have a table like as shown below
I would like to create two new binary columns
indicating whether the subject had steroids
and aspirin
. I am looking to implement this in Postgresql and google bigquery
I tried the below but it doesn't work
select subject_id
case when lower(drug) like ('%cortisol%','%cortisone%','%dexamethasone%')
then 1 else 0 end as steroids,
case when lower(drug) like ('%peptide%','%paracetamol%')
then 1 else 0 end as aspirin,
from db.Team01.Table_1
SELECT
db.Team01.Table_1.drug
FROM `table_1`,
UNNEST(table_1.drug) drug
WHERE REGEXP_CONTAINS( db.Team01.Table_1.drug,r'%cortisol%','%cortisone%','%dexamethasone%')
I expect my output to be like as shown below
Upvotes: 18
Views: 93780
Reputation: 45105
I've not used BigQuery but have been reading the docs researching it. I came across this while looking into impact of choosing collation at design stage.
I'm either wrong or this is a new feature since answers above.
https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#contains_substr
Performs a normalized, case-insensitive search to see if a value exists as a substring in an expression. Returns TRUE if the value exists, otherwise returns FALSE.
Before values are compared, they are normalized and case folded with NFKC normalization. Wildcard searches are not supported.
Upvotes: 9
Reputation: 46
Another potentially more intutive solution would be to use the BigQuery Contains_Substr to return boolean results.
Upvotes: 2
Reputation: 1269563
In Postgres, I would recommend using the filter
clause:
select subject_id,
count(*) filter (where lower(drug) ~ 'cortisol|cortisone|dexamethasone') as steroids,
count(*) filter (where lower(drug) ~ 'peptide|paracetamol') as aspirin,
from db.Team01.Table_1
group by subject_id;
In BigQuery, I would recommend countif()
:
select subject_id,
countif(regexp_contains(drug, 'cortisol|cortisone|dexamethasone') as steroids,
countif(drug ~ ' 'peptide|paracetamol') as aspirin,
from db.Team01.Table_1
group by subject_id;
You can use sum(case when . . . end)
as a more general approach. However, each database has a more "local" way of expressing this logic. By the way, the FILTER
clause is standard SQL, just not widely adopted.
Upvotes: 4
Reputation: 172974
Below is for BigQuery Standard SQL
#standardSQL
SELECT
subject_id,
SUM(CASE WHEN REGEXP_CONTAINS(LOWER(drug), r'cortisol|cortisone|dexamethasone') THEN 1 ELSE 0 END) AS steroids,
SUM(CASE WHEN REGEXP_CONTAINS(LOWER(drug), r'peptide|paracetamol') THEN 1 ELSE 0 END) AS aspirin
FROM `db.Team01.Table_1`
GROUP BY subject_id
if to apply to sample data from your question - result is
Row subject_id steroids aspirin
1 1 3 1
2 2 1 1
Note: instead of simple LIKE ending with lengthy and redundant text - I am using LIKE on steroids
- which is REGEXP_CONTAINS
Upvotes: 19
Reputation: 13006
you have missing group-by
select subject_id,
sum(case when lower(drug) in ('cortisol','cortisone','dexamethasone')
then 1 else 0 end) as steroids,
sum(case when lower(drug) in ('peptide','paracetamol')
then 1 else 0 end) as aspirin
from db.Team01.Table_1
group by subject_id
using like
keyword
select subject_id,
sum(case when lower(drug) like '%cortisol%'
or lower(drug) like '%cortisone%'
or lower(drug) like '%dexamethasone%'
then 1 else 0 end) as steroids,
sum(case when lower(drug) like '%peptide%'
or lower(drug) like '%paracetamol%'
then 1 else 0 end) as aspirin
from db.Team01.Table_1
group by subject_id
Upvotes: 1
Reputation: 222432
Use conditional aggregation. This is a solution that works across most (if not all) RDBMS:
SELECT
subject_id,
MAX(CASE WHEN drug IN ('cortisol', 'cortisone', 'dexamethasone') THEN 1 END) steroids,
MAX(CASE WHEN drug IN ('peptide', 'paracetamol') THEN 1 END) aspirin
FROM db.Team01.Table_1.drug
GROUP BY subject_id
NB: it is unclear why you are using LIKE
, since it seems like you are having exact matches; I turned the LIKE
condition to equalities.
Upvotes: 1