Reputation: 2085
I have a data frame like below:
df = pd.DataFrame({'var1': ['0,3788,99,20.88', '3,99022,08,91.995'],
'var2': ['0,929,92,299.90', '1,38333,9,993.11'],
'var3': ['8,9332,99,29.10', '7,922111,07,45.443']})
Out[248]:
var1 var2 var3
0 0,3788,99,20.88 0,929,92,299.90 8,9332,99,29.10
1 3,99022,08,91.995 1,38333,9,993.11 7,922111,07,45.443
I want to split each column on comma and same the new set of columns next to each other. So the resulting data frame should look like below:
df2 = pd.DataFrame({('var1', 'x1'): [0, 3], ('var1', 'x2'): [3788, 99022], ('var1', 'x3'): [99, '08'], ('var1', 'x4'): [20.88, 91.995],
('var2', 'x1'): [0, 1], ('var2', 'x2'): [929, 38333], ('var2', 'x3'): [92, 9], ('var2', 'x4'): [299.90, 993.11],
('var3', 'x1'): [8, 7], ('var3', 'x2'): [9332, 922111], ('var3', 'x3'): [99, '07'], ('var3', 'x4'): [29.10, 45.443]})
Out[249]:
var1 var2 var3
x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4
0 0 3788 99 20.880 0 929 92 299.90 8 9332 99 29.100
1 3 99022 08 91.995 1 38333 9 993.11 7 922111 07 45.443
The MultiIndex
is not mandatory, but then I'd like to have an opportunity to easily gather the data and obtain df3 if needed:
var x1 x2 x3 x4
0 var1 0 3788 99 20.880
1 var1 3 99022 08 91.995
0 var2 0 929 92 299.900
1 var2 1 38333 9 993.110
0 var3 8 9332 99 29.100
1 var3 7 922111 07 45.443
My effort included pd.melt
and str.split
:
df_long = pd.melt(df.reset_index(drop = False), id_vars = 'index', var_name = 'var', value_name = 'values') \
.sort_values(['index', 'var']) \
.set_index('index')
df_long = df_long['values'].str.split(',', expand = True)
df_long.columns = ['x' + str(i) for i in range(df_long.shape[1])]
But:
1) I don't know how to then spread the data for different var1, var2, var3...
next to each other
2) transforming from wide format to long format (df
to df_long
) and back again (df_long
to df3
) seems highly inefficient and I care for performance with the seeking solution.
So what's the best way to transform from df
to df2
, so that we could then easily obtain df3
if needed?
Upvotes: 3
Views: 95
Reputation: 1054
Here is an approach that gets df3 first:
df3 = pd.concat([df[s].str.split(',', expand=True).add_prefix("x").assign(var=s) for s in df])
print(df3)
x0 x1 x2 x3 var
0 0 3788 99 20.88 var1
1 3 99022 08 91.995 var1
0 0 929 92 299.90 var2
1 1 38333 9 993.11 var2
0 8 9332 99 29.10 var3
1 7 922111 07 45.443 var3
And then:
df2 = df3.set_index("var", append=True).unstack().swaplevel(axis=1).sort_index(axis=1)
print(df2)
var var1 var2 var3
x0 x1 x2 x3 x0 x1 x2 x3 x0 x1 x2 x3
0 0 3788 99 20.88 0 929 92 299.90 8 9332 99 29.10
1 3 99022 08 91.995 1 38333 9 993.11 7 922111 07 45.443
Upvotes: 1
Reputation: 75080
You can use stack
, str.split()
with expand=True
, unstack()
to achieve this:
final=(df.stack().str.split(',',expand=True).unstack().swaplevel(axis=1)
.sort_index(level=0,axis=1))
print(final)
var1 var2 var3
0 1 2 3 0 1 2 3 0 1 2 3
0 0 3788 99 20.88 0 929 92 299.90 8 9332 99 29.10
1 3 99022 08 91.995 1 38333 9 993.11 7 922111 07 45.443
For renaming the 0th level of the columns, use;
final.columns=pd.MultiIndex.from_tuples([(a,f'x{b}') for a,b in final.columns])
var1 var2 var3
x0 x1 x2 x3 x0 x1 x2 x3 x0 x1 x2 x3
0 0 3788 99 20.88 0 929 92 299.90 8 9332 99 29.10
1 3 99022 08 91.995 1 38333 9 993.11 7 922111 07 45.443
You can also use the below for the second output shown in your question:
df.stack().str.split(',',expand=True).add_prefix('x').reset_index(1).reset_index(drop=True)
level_1 x0 x1 x2 x3
0 var1 0 3788 99 20.88
1 var2 0 929 92 299.90
2 var3 8 9332 99 29.10
3 var1 3 99022 08 91.995
4 var2 1 38333 9 993.11
5 var3 7 922111 07 45.443
Upvotes: 1