Canovice
Canovice

Reputation: 10173

In BigQuery, remove duplicate strings in string column

Here is a short example of a messy CSV file that is loaded into Bigquery each day:

select 'Basketball' as sport, 'Main League' as league union all
select 'Basketball' as sport, 'Main League' as league union all
select 'BasketballBaseball' as sport, 'Second LeagueSecond League' as league union all
select 'Basketball' as sport, 'Second League' as league union all
select 'Basketball' as sport, 'Third League' as league union all
select 'BasketballFootball' as sport, 'Third LeagueThird League' as league union all
select 'Basketball' as sport, 'Third League' as league

The issue is that when a player is involved in multiple sports, the sports and leagues are concatenated into the same column. Assuming we know that Basketball would always be the first sport, we can check the sport column for extra sports in a few ways:

What is tougher is to clean up the league column. The lack of space between the duplicate league names is tricky to deal with. Our desired output is:

select 'Basketball' as sport, 'Main League' as league union all
select 'Basketball' as sport, 'Main League' as league union all
select 'Basketball' as sport, 'Second League' as league union all
select 'Basketball' as sport, 'Second League' as league union all
select 'Basketball' as sport, 'Third League' as league union all
select 'Basketball' as sport, 'Third League' as league union all
select 'Basketball' as sport, 'Third League' as league

Upvotes: 1

Views: 187

Answers (1)

Daniel Zagales
Daniel Zagales

Reputation: 3034

Try the following:

with sample_data as (
    select 'Basketball' as sport, 'Main League' as league union all
    select 'Basketball' as sport, 'Main League' as league union all
    select 'BasketballBaseball' as sport, 'Second LeagueSecond League' as league union all
    select 'Basketball' as sport, 'Second League' as league union all
    select 'Basketball' as sport, 'Third League' as league union all
    select 'BasketballFootball' as sport, 'Third LeagueThird League' as league union all
    select 'BasketballFootballSoccer' as sport, 'Third LeagueThird LeagueFourth League' as league union all
    select 'Basketball' as sport, 'Third League' as league
)

select 
    *
from sample_data
, UNNEST(regexp_extract_all(sport, r'([A-Z][a-z]+)')) as split_sport with offset as ss_offset
left join UNNEST(regexp_extract_all(league, r'([A-Z][a-z]+ League)')) as split_league with offset as sl_offset
 on ss_offset=sl_offset

Using regex functions allows you to get the elements you want into an array. Then you can join on the offsets to match the sport with the league.

This produces the following with the sample data provided, as a note i added a 3 sport scenario.

enter image description here

At this point you can add additional filter criteria to include just the 0 offset, a specific sport, or league.

Upvotes: 2

Related Questions