Reputation: 10173
Here is a short example of a messy CSV file that is loaded into Bigquery each day:
select 'Basketball' as sport, 'Main League' as league union all
select 'Basketball' as sport, 'Main League' as league union all
select 'BasketballBaseball' as sport, 'Second LeagueSecond League' as league union all
select 'Basketball' as sport, 'Second League' as league union all
select 'Basketball' as sport, 'Third League' as league union all
select 'BasketballFootball' as sport, 'Third LeagueThird League' as league union all
select 'Basketball' as sport, 'Third League' as league
The issue is that when a player is involved in multiple sports, the sports and leagues are concatenated into the same column. Assuming we know that Basketball
would always be the first sport, we can check the sport
column for extra sports in a few ways:
right(10) = 'basketball'
length(sport) = 10
What is tougher is to clean up the league
column. The lack of space between the duplicate league names is tricky to deal with. Our desired output is:
select 'Basketball' as sport, 'Main League' as league union all
select 'Basketball' as sport, 'Main League' as league union all
select 'Basketball' as sport, 'Second League' as league union all
select 'Basketball' as sport, 'Second League' as league union all
select 'Basketball' as sport, 'Third League' as league union all
select 'Basketball' as sport, 'Third League' as league union all
select 'Basketball' as sport, 'Third League' as league
Upvotes: 1
Views: 187
Reputation: 3034
Try the following:
with sample_data as (
select 'Basketball' as sport, 'Main League' as league union all
select 'Basketball' as sport, 'Main League' as league union all
select 'BasketballBaseball' as sport, 'Second LeagueSecond League' as league union all
select 'Basketball' as sport, 'Second League' as league union all
select 'Basketball' as sport, 'Third League' as league union all
select 'BasketballFootball' as sport, 'Third LeagueThird League' as league union all
select 'BasketballFootballSoccer' as sport, 'Third LeagueThird LeagueFourth League' as league union all
select 'Basketball' as sport, 'Third League' as league
)
select
*
from sample_data
, UNNEST(regexp_extract_all(sport, r'([A-Z][a-z]+)')) as split_sport with offset as ss_offset
left join UNNEST(regexp_extract_all(league, r'([A-Z][a-z]+ League)')) as split_league with offset as sl_offset
on ss_offset=sl_offset
Using regex functions allows you to get the elements you want into an array. Then you can join on the offsets to match the sport with the league.
This produces the following with the sample data provided, as a note i added a 3 sport scenario.
At this point you can add additional filter criteria to include just the 0 offset, a specific sport, or league.
Upvotes: 2