Anya
Anya

Reputation: 123

Google BigQuery Replace function for string type

I am trying to replace certain customer names in my data. I was able to do SQL using Google BigQuery language to transform one part of the string another via the replace function for one particular string.

Replace(CustomerName, 'ABC', 'XYZ')

However, I have a couple more that I would need to use the replace function such that

Replace(CustomerName, 'PLO', 'Rustic')
Replace(CustomerName, 'Kix', 'BowWow')

and so on.

I've tried doing

Replace(CustomerName, 'ABC', 'XYZ') OR Replace(CustomerName, 'PLO', 'Rustic') OR Replace(CustomerName, 'Kix', 'BowWow')

but that got me an error message.

I've also tried

Replace(CustomerName, 'ABC', 'XYZ') AND Replace(CustomerName, 'PLO', 'Rustic') AND Replace(CustomerName, 'Kix', 'BowWow')

but that also got me an error message.

I am able to just use "case when statement" and then hardcode each one, but I'm wondering if there is a better/faster way to just use replace statement instead.

Thanks for your help.

Upvotes: 7

Views: 59637

Answers (3)

Eduardo Mota
Eduardo Mota

Reputation: 1

I believe there are multiple ways to tackle this problem, and it depends on the size of your dataset, practicality of simply making a guiding table by hand and uploading it to BigQuery, and the granularity of the data you want to replace.

If your values are very granular, you can create a table with "from" and "to" values on different columns, and join that table with your main table, and retrieve those values very cleanly.

# Replace the support_table table with your actual table
WITH support_table AS (
    SELECT "ABC" AS OldValue, "XYZ" AS NewValue
)
SELECT main_table.OldValue, support_table.NewValue FROM main_table
JOIN support_table ON main_table.old_value = support_table.old_value

Now, if you want to replace a big list of different values with something, you can use REGEXP_REPLACE with a string containing all possible values.

If you have a very big list of items, you can use STRING_AGG in a table with all the values you want to replace, or skip the STRING_AGG step and create said string by hand. Both of the snippets below result in "item1|item2|item3". Choose which is faster for you to do.

# Replace the values_to_replace table with your actual table
WITH values_to_replace AS (
   SELECT "item1" AS ColumnWithItemsToReplace
   UNION ALL
   SELECT "item2"
   UNION ALL
   SELECT "item3"
)
SELECT STRING_AGG(ColumnsWithItemsToReplace,"|") FROM values_to_replace
SELECT r"item1|item2|item3"

STRING_AGG will retrieve all the values from a table or query and concatenate them using a separator of choice. If you use the pipe separator, you will be able to create a string like "item1|item2|item3|..."

For a regular expression, the pipe counts as "or", which means that the regex will interpret the string as "item1 or item2 or item3". Thus, if you pass that generated string to REGEXP_REPLACE as the values to be replaced, it will be considered valid.

Example code below:

REGEXP_REPLACE(
 column_to_replace
,(SELECT STRING_AGG(ColumnWithItemsToReplace,"|") FROM `YourTable`)
,"Replacer"
)

Hope it helps.

Upvotes: 0

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172974

In cases when you have quite a number of replacements - chaining of REPLACEs can become not practical and annoying manual work.
Below addresses this potential issue (assuming you maintain Lookup table with pairs: Word, Replacement)

SELECT CustomerName, fixedCustomerName FROM JS(
// input table
(
  SELECT
    CustomerName, Replacements
  FROM YourTable
  CROSS JOIN (
    SELECT 
      GROUP_CONCAT_UNQUOTED(CONCAT(Word, ',', Replacement), ';') AS Replacements
    FROM ReplacementLookup
) ,
// input columns
CustomerName, Replacements,
// output schema
"[
{name: 'CustomerName', type: 'string'},
{name: 'fixedCustomerName', type: 'string'}
]",
// function
"function(r, emit){
  var Replacements = r.Replacements.split(';');
  var fixedCustomerName = r.CustomerName;
  for (var i = 0; i < Replacements.length; i++) {
    var pat = new RegExp(Replacements[i].split(',')[0],'gi')
    fixedCustomerName = fixedCustomerName.replace(pat, Replacements[i].split(',')[1]);
  }
  emit({CustomerName: r.CustomerName,fixedCustomerName: fixedCustomerName});
 }"
)

You can test it using below example

SELECT CustomerName, fixedCustomerName FROM JS(
// input table
(
  SELECT
    CustomerName, Replacements
  FROM (
    SELECT CustomerName FROM
      (SELECT '1234ABC567' AS CustomerName),
      (SELECT '12 34 PLO 56' AS CustomerName),
      (SELECT 'Kix' AS CustomerName),
      (SELECT '98 ABC PLO Kix ABC 76 XYZ 54' AS CustomerName),
      (SELECT 'ABCQweKIX' AS CustomerName)
  ) YourTable
  CROSS JOIN (
    SELECT 
      GROUP_CONCAT_UNQUOTED(CONCAT(Word, ',', Replacement), ';') AS Replacements
    FROM (
      SELECT Word, Replacement FROM
        (SELECT 'XYZ' AS Word, 'QWE' AS Replacement),
        (SELECT 'ABC' AS Word, 'XYZ' AS Replacement),
        (SELECT 'PLO' AS Word, 'Rustic' AS Replacement),
        (SELECT 'Kix' AS Word, 'BowWow' AS Replacement)
    )
  ) ReplacementLookup
) ,
// input columns
CustomerName, Replacements,
// output schema
"[
{name: 'CustomerName', type: 'string'},
{name: 'fixedCustomerName', type: 'string'}
]",
// function
"function(r, emit){
  var Replacements = r.Replacements.split(';');
  var fixedCustomerName = r.CustomerName;
  for (var i = 0; i < Replacements.length; i++) {
    var pat = new RegExp(Replacements[i].split(',')[0],'gi')
    fixedCustomerName = fixedCustomerName.replace(pat, Replacements[i].split(',')[1]);
  }
  emit({CustomerName: r.CustomerName,fixedCustomerName: fixedCustomerName});
 }"
)

Please note: there is still issue if result of one replacement matches the input to a subsequent replacement

Upvotes: 0

Jeremy Condit
Jeremy Condit

Reputation: 7046

The CASE WHEN option is pretty reasonable. Another option is to chain them together:

REPLACE(
  REPLACE(
    REPLACE(
      CustomerName,
      'ABC',
      'XYZ'),
    'PLO',
    'Rustic'),
  'Kix',
  'BowWow')

Which one you pick really depends on the exact scenario. The chained REPLACE calls are probably faster, but they could overlap in weird ways (e.g., if the output to one replacement matches the input to a subsequent one). The CASE WHEN approach avoids that issue, but it's probably more expensive because you need to do one operation to find the substring and another to actually replace it.

Note that when you're using AND or OR, you're trying to combine the string output of REPLACE as if it were a boolean, which is why it's failing.

Upvotes: 7

Related Questions