Reputation: 125
I have the following column in a dataframe:
column_1
en-us,en-en
pr,en-us,en-en,br
ar-ar,pr,en-en
I want to Split that column (this can be done with .str.split) but using .Split I will get:
column_1 | column_2 | column_3 | column_4
en-us en-en
pr en-us en-en br
ar-ar pr en-en
And what I need is:
column_1 | column_2 | column_3 | column_4
en-us en-en
en-us en-en br pr
ar-ar en-en pr
Is there any automatic way of doing this?
Upvotes: 1
Views: 51
Reputation: 59274
IIUC, you can do by passing a list of dictionaries to the default pd.DataFrame
constructor. For example,
df = pd.DataFrame(s.str.split(',').transform(lambda x: {k:k for k in x}).tolist())
yields
r-ar br en-en en-us pr
0 NaN NaN en-en en-us NaN
1 NaN br en-en en-us pr
2 ar-ar NaN en-en NaN pr
Notice that it is trivial to reorder the data frame according to your needs, e.g.
>>> df[['en-en', 'en-us', 'br', 'pr']]
en-en en-us br pr
0 en-en en-us NaN NaN
1 en-en en-us br pr
2 en-en NaN NaN pr
And if you want to have empty strings rather than NaN
s, just use .fillna()
df[['en-en', 'en-us', 'br', 'pr']].fillna('')
en-en en-us br pr
0 en-en en-us
1 en-en en-us br pr
2 en-en pr
Explanation
Let's break down the following statement
s.str.split(',').transform(lambda x: {k:k for k in x}).tolist()
First of all, s.str.split(',')
does what you already know: splits using ,
as separator. This yields the following series
0 [en-us, en-en]
1 [pr, en-us, en-en, br]
2 [ar-ar, pr, en-en]
Name: col1, dtype: object
Now, we want to change each of these elements into a {key:value}
structure. For that, we use transform
passing a function to it:
s.str.split(',').transform(function)
where function = lambda x: {k:k for k in x}
. So basically we will run this func
for the input [en-us, en-en]
, then for [pr, en-us, en-en, br]
, etc. The output of this function is
0 {'en-en': 'en-en', 'en-us': 'en-us'}
1 {'br': 'br', 'en-en': 'en-en', 'en-us': 'en-us...
2 {'en-en': 'en-en', 'ar-ar': 'ar-ar', 'pr': 'pr'}
Now, we just use tolist()
to get a list of these values, and input that in the pd.DataFrame()
constructor. The constructor knows how to deal with lists of dictionaries pretty well, and it assigns values based on the keys
of the dictionaries for each row. Whenever no key/value is found for a row, it just uses NaN
s
Upvotes: 2