Glyn Bury
Glyn Bury

Reputation: 63

Is it possible to remove duplicates from a string in Big Query?

So been working with some data and currently have output along the lines of

Customer | Reasons
Customer1 | Answer1, Answer3, Answer2, Answer4, Answer5, Answer1, Answer3, Answer1

Is there anyway in Big Query standard sql to rid myself of duplicates within this string and end with the output below?

Customer | Reasons
Customer1 | Answer1, Answer3, Answer2, Answer4, Answer5

Thanks in advance

Upvotes: 1

Views: 4791

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

While voting up Elliott's answer - wanted to add another option (BigQuery Standard SQL):

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'Customer1' customer, 'Answer1, Answer3, Answer2, Answer4, Answer5, Answer1, Answer3, Answer1' answers 
)
SELECT * REPLACE(
  ARRAY_TO_STRING(ARRAY(SELECT DISTINCT answer
    FROM UNNEST(SPLIT(answers, ', ')) AS answer 
  ), ', ') AS answers)   
FROM `project.dataset.table`    

which produces result you need

Row customer    answers  
1   Customer1   Answer1, Answer3, Answer2, Answer4, Answer5   

In case if for some reason you would wanted to have those values ordered - you just add one line as below

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'Customer1' customer, 'Answer1, Answer3, Answer2, Answer4, Answer5, Answer1, Answer3, Answer1' answers 
)
SELECT * REPLACE(
  ARRAY_TO_STRING(ARRAY(SELECT DISTINCT answer
    FROM UNNEST(SPLIT(answers, ', ')) AS answer 
    ORDER BY answer
  ), ', ') AS answers)   
FROM `project.dataset.table`     

with result as

Row customer    answers  
1   Customer1   Answer1, Answer2, Answer3, Answer4, Answer5      

Note: most likely need in ordering is not relevant for particular use case in your question - it can be handy in other cases

Upvotes: 3

Elliott Brossard
Elliott Brossard

Reputation: 33745

Assuming I understood the question correctly, you want something like:

SELECT
  (SELECT STRING_AGG(DISTINCT s, ', ')
   FROM UNNEST(SPLIT(Customer1, ', ')) AS s) AS Customer1
FROM dataset.table

This splits the string on the ', ' separator, then aggregates the substrings into a new string with duplicates removed using the DISTINCT keyword.

Upvotes: 6

Related Questions