Reputation: 21
I have a string variable consisting of 4 parts looking either something like this:
social_CA_FR_bloom_build
or this:
social_DE_bloom_trgj4_67
(and different variations of this).
How would I write a regex in BigQuery to extract the subsets so I can build new columns containing the first part (social), second part (CA_FR or DE etc
), third part (bloom) and fourth part (trgj4_67 or build, etc
)?
I've already tried this:
regexp_extract(field, "([^_]+)_([A-Z]{2}[_A-Z]*)_([^_]+)_(.+)", 1)
which works in SQL (hive), but not in BQ as it doesn't recognize (string, string, integer).
I was thinking of replacing the underscore between CA_FR (which should be kept together and only occurs in a couple of rows, normally it's just two capitals) with a hyphen and then using SPLIT on the string instead of a regex_extract, but I'm not sure how to go about this/if this is the solution.
Any help would be greatly appreciated!
Upvotes: 1
Views: 1621
Reputation: 33705
One idea is to apply a regular expression and then use SPLIT
. For example:
#standardSQL
SELECT
parts[SAFE_OFFSET(0)] AS social,
parts[SAFE_OFFSET(1)] AS language,
parts[SAFE_OFFSET(2)] AS bloom,
parts[SAFE_OFFSET(3)] AS build
FROM (
SELECT
SPLIT(REGEXP_REPLACE(
label,
r'([^_]+)_([A-Z]{2}[_A-Z]*)_([^_]+)_(.+)',
r'\1|\2|\3|\4'),
'|') AS parts
FROM YourTable
);
I made up the column names--you can use whichever names you want. As a self-contained example:
#standardSQL
WITH YourTable AS (
SELECT 'social_CA_FR_bloom_build' AS label UNION ALL
SELECT 'social_DE_bloom_trgj4_67' AS label
)
SELECT
parts[SAFE_OFFSET(0)] AS social,
parts[SAFE_OFFSET(1)] AS language,
parts[SAFE_OFFSET(2)] AS bloom,
parts[SAFE_OFFSET(3)] AS build
FROM (
SELECT
SPLIT(REGEXP_REPLACE(
label,
r'([^_]+)_([A-Z]{2}[_A-Z]*)_([^_]+)_(.+)',
r'\1|\2|\3|\4'),
'|') AS parts
FROM YourTable
);
Upvotes: 1