Split column in pandas of comma separated values but maintining the order

Question

I have the following column in a dataframe:

column_1
en-us,en-en
pr,en-us,en-en,br
ar-ar,pr,en-en

I want to Split that column (this can be done with .str.split) but using .Split I will get:

column_1 | column_2 | column_3 | column_4
en-us      en-en
pr         en-us      en-en      br
ar-ar      pr         en-en

And what I need is:

column_1 | column_2 | column_3 | column_4
en-us      en-en      
en-us      en-en      br         pr
ar-ar      en-en                 pr

Is there any automatic way of doing this?

rafaelc · Accepted Answer

IIUC, you can do by passing a list of dictionaries to the default pd.DataFrame constructor. For example,

df = pd.DataFrame(s.str.split(',').transform(lambda x: {k:k for k in x}).tolist())

yields

    r-ar    br  en-en   en-us   pr
0   NaN     NaN en-en   en-us   NaN
1   NaN     br  en-en   en-us   pr
2   ar-ar   NaN en-en   NaN     pr

Notice that it is trivial to reorder the data frame according to your needs, e.g.

>>> df[['en-en', 'en-us', 'br', 'pr']]
    en-en   en-us   br  pr
0   en-en   en-us   NaN NaN
1   en-en   en-us   br  pr
2   en-en   NaN     NaN pr

And if you want to have empty strings rather than NaNs, just use .fillna()

df[['en-en', 'en-us', 'br', 'pr']].fillna('')

    en-en   en-us   br  pr
0   en-en   en-us       
1   en-en   en-us   br  pr
2   en-en           pr

Explanation

Let's break down the following statement

s.str.split(',').transform(lambda x: {k:k for k in x}).tolist()

First of all, s.str.split(',') does what you already know: splits using , as separator. This yields the following series

0            [en-us, en-en]
1    [pr, en-us, en-en, br]
2        [ar-ar, pr, en-en]
Name: col1, dtype: object

Now, we want to change each of these elements into a {key:value} structure. For that, we use transform passing a function to it:

s.str.split(',').transform(function)

where function = lambda x: {k:k for k in x}. So basically we will run this func for the input [en-us, en-en], then for [pr, en-us, en-en, br], etc. The output of this function is

0                 {'en-en': 'en-en', 'en-us': 'en-us'}
1    {'br': 'br', 'en-en': 'en-en', 'en-us': 'en-us...
2     {'en-en': 'en-en', 'ar-ar': 'ar-ar', 'pr': 'pr'}

Now, we just use tolist() to get a list of these values, and input that in the pd.DataFrame() constructor. The constructor knows how to deal with lists of dictionaries pretty well, and it assigns values based on the keys of the dictionaries for each row. Whenever no key/value is found for a row, it just uses NaNs

Split column in pandas of comma separated values but maintining the order

Answers (1)

Related Questions