Reputation: 689
I have a dataframe with lots of data and 1 column that is structured like this:
index var_1
1 a=3:b=4:c=5:d=6:e=3
2 b=3:a=4:c=5:d=6:e=3
3 e=3:a=4:c=5:d=6
4 c=3:a=4:b=5:d=6:f=3
I am trying to structure the data in that column to look like this:
index a b c d e f
1 3 4 5 6 3 0
2 4 3 5 6 3 0
3 4 0 5 6 3 0
4 4 5 3 6 0 3
I have done the following thus far:
df1 = df['var1'].str.split(':', expand=True)
I can then loop through the cols of df1 and do another split on '=', but then I'll just have loads of disorganised label cols and value cols.
Upvotes: 2
Views: 238
Reputation: 2407
You can apply "extractall" and "pivot".
After "extractall" you get:
0 1
index match
1 0 a 3
1 b 4
2 c 5
3 d 6
4 e 3
2 0 b 3
1 a 4
2 c 5
3 d 6
4 e 3
3 0 e 3
1 a 4
2 c 5
3 d 6
4 0 c 3
1 a 4
2 b 5
3 d 6
4 f 3
And in one step:
rslt= df.var_1.str.extractall(r"([a-z])=(\d+)") \
.reset_index(level="match",drop=True) \
.pivot(columns=0).fillna(0)
1
0 a b c d e f
index
1 3 4 5 6 3 0
2 4 3 5 6 3 0
3 4 0 5 6 3 0
4 4 5 3 6 0 3
#rslt.columns= rslt.columns.levels[1].values
Upvotes: 0
Reputation: 88305
Here's one approach using str.get_dummies
:
out = df.var_1.str.get_dummies(sep=':')
out = out * out.columns.str[2:].astype(int).values
out.columns = pd.MultiIndex.from_arrays([out.columns.str[0], out.columns])
print(out.max(axis=1, level=0))
a b c d e f
index
1 3 4 5 6 3 0
2 4 3 5 6 3 0
3 4 0 5 6 3 0
4 4 5 3 6 0 3
Upvotes: 1
Reputation: 863541
Use list comprehension with dictionaries for each value and pass to DataFrame
constructor:
comp = [dict([y.split('=') for y in x.split(':')]) for x in df['var_1']]
df = pd.DataFrame(comp).fillna(0).astype(int)
print (df)
a b c d e f
0 3 4 5 6 3 0
1 4 3 5 6 3 0
2 4 0 5 6 3 0
3 4 5 3 6 0 3
Or use Series.str.split
with expand=True
for DataFrame
, reshape by DataFrame.stack
, again split, remove first level of MultiIndex
and add new level by 0
column, last reshape by Series.unstack
:
df = (df['var_1'].str.split(':', expand=True)
.stack()
.str.split('=', expand=True)
.reset_index(level=1, drop=True)
.set_index(0, append=True)[1]
.unstack(fill_value=0)
.rename_axis(None, axis=1))
print (df)
a b c d e f
1 3 4 5 6 3 0
2 4 3 5 6 3 0
3 4 0 5 6 3 0
4 4 5 3 6 0 3
Upvotes: 5