Reputation: 113
I'm relatively new to Python and would like to seek your help in the following problem. My current series of information looks like this:
df['word_feats']
------------------------------------------------
0 Case=Loc|Gender=Neut|Number=Sing|Person=3
1 Case=Nom|Gender=Neut|Number=Sing|Person=3
2 PunctType=Comm
3 Case=Nom|Gender=Neut|Number=Sing|Person=3
4 Case=Nom|Gender=Neut|Number=Sing|Person=3
5 None
6 Case=Nom|Gender=Neut|Number=Sing|Person=3
7 Case=Loc|Gender=Neut|Number=Sing|Person=3
8 Gender=Neut|Number=Sing|Person=3
9 Case=Loc|Gender=Neut|Number=Plur|Person=3
10 None
My results in a single column are not in a dictionary format (which would be easier). Instead, it is in a pipe separated format with equals sign.
I am hoping to split this series into multiple columns, taking the 'key' (in this case, LEFT of equal sign) as the column name, and 'value' (RIGHT of equal sign) as my cell.
I've tried something along this line:
df['word_feats'].str.split('|', expand=True)
Which doesn't work as:
Would appreciate any answers for this! Thanks.
Upvotes: 2
Views: 547
Reputation: 113
I have overcome this issue by doing the following:
Step 1: Convert series to dictionary:
word_feats_dict=[]
for i in df['word_feats']:
if i== None:
q='None'
else:
q = dict(item.split('=') for item in i.split('|'))
word_feats_dict.append(q)
df['word_feats_dict']=word_feats_dict # creating another series that is in dictionary form.
Step 2: Explode dictionary into Multiple columns
df['word_feats_dict'].apply(pd.Series)
df = pd.concat([df.drop(['word_feats_dict'],axis=1),df['word_feats_dict'].apply(pd.Series)],axis=1)
df.drop(columns=0,inplace=True) # As I have 'None' as some of my values, they turned out to be a column on itself called '0'.
Upvotes: 0
Reputation: 71687
We can use a regular expression pattern to find all the occurrences of key-value pairs from each row, this will generate a list of tuples in every row, then map each list to dict
and construct a dataframe from the mapped records
pd.DataFrame(map(dict, df['word_feats'].str.findall(r'([^|=]+)=([^|]+)')))
Case Gender Number Person PunctType
0 Loc Neut Sing 3 NaN
1 Nom Neut Sing 3 NaN
2 NaN NaN NaN NaN Comm
3 Nom Neut Sing 3 NaN
4 Nom Neut Sing 3 NaN
5 NaN NaN NaN NaN NaN
6 Nom Neut Sing 3 NaN
7 Loc Neut Sing 3 NaN
8 NaN Neut Sing 3 NaN
9 Loc Neut Plur 3 NaN
10 NaN NaN NaN NaN NaN
Upvotes: 3