Svkm
Svkm

Reputation: 21

How extract subsets from string in bigquery

I have a string variable consisting of 4 parts looking either something like this:

social_CA_FR_bloom_build

or this:

 social_DE_bloom_trgj4_67

(and different variations of this).

How would I write a regex in BigQuery to extract the subsets so I can build new columns containing the first part (social), second part (CA_FR or DE etc), third part (bloom) and fourth part (trgj4_67 or build, etc)?

I've already tried this:

regexp_extract(field, "([^_]+)_([A-Z]{2}[_A-Z]*)_([^_]+)_(.+)", 1)

which works in SQL (hive), but not in BQ as it doesn't recognize (string, string, integer).

I was thinking of replacing the underscore between CA_FR (which should be kept together and only occurs in a couple of rows, normally it's just two capitals) with a hyphen and then using SPLIT on the string instead of a regex_extract, but I'm not sure how to go about this/if this is the solution.

Any help would be greatly appreciated!

Upvotes: 1

Views: 1621

Answers (1)

Elliott Brossard
Elliott Brossard

Reputation: 33705

One idea is to apply a regular expression and then use SPLIT. For example:

#standardSQL
SELECT
  parts[SAFE_OFFSET(0)] AS social,
  parts[SAFE_OFFSET(1)] AS language,
  parts[SAFE_OFFSET(2)] AS bloom,
  parts[SAFE_OFFSET(3)] AS build
FROM (
  SELECT
    SPLIT(REGEXP_REPLACE(
            label,
            r'([^_]+)_([A-Z]{2}[_A-Z]*)_([^_]+)_(.+)',
            r'\1|\2|\3|\4'),
          '|') AS parts
  FROM YourTable
);

I made up the column names--you can use whichever names you want. As a self-contained example:

#standardSQL
WITH YourTable AS (
  SELECT 'social_CA_FR_bloom_build' AS label UNION ALL
  SELECT 'social_DE_bloom_trgj4_67' AS label
)
SELECT
  parts[SAFE_OFFSET(0)] AS social,
  parts[SAFE_OFFSET(1)] AS language,
  parts[SAFE_OFFSET(2)] AS bloom,
  parts[SAFE_OFFSET(3)] AS build
FROM (
  SELECT
    SPLIT(REGEXP_REPLACE(
            label,
            r'([^_]+)_([A-Z]{2}[_A-Z]*)_([^_]+)_(.+)',
            r'\1|\2|\3|\4'),
          '|') AS parts
  FROM YourTable
);

Upvotes: 1

Related Questions