N_Spen
N_Spen

Reputation: 61

Split Pandas Column by values that are in a list

I have three lists that look like this:

    age = ['51+', '21-30', '41-50', '31-40', '<21']
    cluster = ['notarget', 'cluster3', 'allclusters', 'cluster1', 'cluster2']
    device = ['htc_one_2gb','iphone_6/6+_at&t','iphone_6/6+_vzn','iphone_6/6+_all_other_devices','htc_one_2gb_limited_time_offer','nokia_lumia_v3','iphone5s','htc_one_1gb','nokia_lumia_v3_more_everything']

I also have column in a df that looks like this:

    campaign_name   
0   notarget_<21_nokia_lumia_v3 
1   htc_one_1gb_21-30_notarget  
2   41-50_htc_one_2gb_cluster3      
3   <21_htc_one_2gb_limited_time_offer_notarget 
4   51+_cluster3_iphone_6/6+_all_other_devices

I want to split the column into three separate columns based on the values in the above lists. Like so:

    age     cluster     device  
0   <21     notarget    nokia_lumia_v3  
1   21-30   notarget    htc_one_1gb     
2   41-50   cluster3    htc_one_2gb  
3   <21     notarget    htc_one_2gb_limited_time_offer
4   51+     cluster3    iphone_6/6+_all_other_devices

First thought was to do a simple test like this:

ages_list = []

for i in ages:
    if i in df['campaign_name'][0]:
        ages_list.append(i)

print ages_list
>>>   ['<21']

I was then going to convert ages_list to a series and combine it with the remaining two to get the end result above but i assume there is a more pythonic way of doing it?

Upvotes: 1

Views: 65

Answers (1)

Nader Hisham
Nader Hisham

Reputation: 5414

the idea behind this is that you'll create a regular expression based on the values you already have , for example if you want to build a regular expressions that capture any value from your age list you may do something like this '|'.join(age) and so on for all the values you already have cluster & device.

a special case for device list becuase it contains + sign that will conflict with the regex ( because + means one or more when it comes to regex ) so we can fix this issue by replacing any value of + with \+ , so this mean I want to capture literally +

df = pd.DataFrame({'campaign_name' : ['notarget_<21_nokia_lumia_v3' , 'htc_one_1gb_21-30_notarget' , '41-50_htc_one_2gb_cluster3' , '<21_htc_one_2gb_limited_time_offer_notarget' , '51+_cluster3_iphone_6/6+_all_other_devices'] }) 


def split_df(df):
        campaign_name = df['campaign_name']
        df['age'] = re.findall('|'.join(age) , campaign_name)[0]
        df['cluster'] = re.findall('|'.join(cluster) , campaign_name)[0]
        df['device'] = re.findall('|'.join([x.replace('+' , '\+') for x in device ]) , campaign_name)[0]

    return df


df.apply(split_df, axis = 1 )

if you want to drop the original column you can do this

df.apply(split_df, axis = 1 ).drop( 'campaign_name', axis = 1)

Here I'm assuming that a value must be matched by regex but if this is not the case you can do your checks , you got the idea

Upvotes: 2

Related Questions