Reputation: 159
I have a pandas data frame formatted as following:
step value0 value1_avg ... value1_label1 value2_label1 ... Id
0 0 1.551959 0.579051 ... 0.45 0.700000 ... 1
1 1 1.515447 0.580128 ... 0.00 1.000000 ... 1
2 2 1.481759 0.277778 ... 0.00 1.000000 ... 1
3 0 1.461281 0.277778 ... 0.00 1.000000 ... 2
4 1 1.442006 0.439286 ... 0.20 0.733333 ... 2
5 2 1.415662 0.505467 ... 0.10 0.766667 ... 2
6 0 1.394186 0.515361 ... 0.00 1.000000 ... 3
7 1 1.374112 0.605263 ... 0.00 1.000000 ... 3
8 2 1.350135 0.803030 ... 0.00 1.000000 ... 3
I need to reformat it like this:
Value1 ... Value1 Value2 ...
step value0 avg ... label1 label1 ... Id
0 0 1.551959 0.579051 ... 0.45 0.700000 ... 1
1 1 1.515447 0.580128 ... 0.00 1.000000 ... 1
2 2 1.481759 0.277778 ... 0.00 1.000000 ... 1
3 0 1.461281 0.277778 ... 0.00 1.000000 ... 2
4 1 1.442006 0.439286 ... 0.20 0.733333 ... 2
5 2 1.415662 0.505467 ... 0.10 0.766667 ... 2
6 0 1.394186 0.515361 ... 0.00 1.000000 ... 3
7 1 1.374112 0.605263 ... 0.00 1.000000 ... 3
8 2 1.350135 0.803030 ... 0.00 1.000000 ... 3
Unfortunately, I am a bit lost with pandas multi-indexing system. Could someone provides me some advice to reformat the data frame as needed?
Thanks a lot.
Upvotes: 2
Views: 36
Reputation: 862511
I think here is best 'sent'
all columns with no _
to MultiIndex in index
, so if use str.split
get nice MultiIndex in columns
:
df = df.set_index([c for c in df.columns if '_' not in c])
df.columns = df.columns.str.split('_', expand=True)
print (df)
value1 value2
avg label1 label1
step value0 Id
0 1.551959 1 0.579051 0.45 0.700000
1 1.515447 1 0.580128 0.00 1.000000
2 1.481759 1 0.277778 0.00 1.000000
0 1.461281 2 0.277778 0.00 1.000000
1 1.442006 2 0.439286 0.20 0.733333
2 1.415662 2 0.505467 0.10 0.766667
0 1.394186 3 0.515361 0.00 1.000000
1 1.374112 3 0.605263 0.00 1.000000
2 1.350135 3 0.803030 0.00 1.000000
Because if use only split
get missing values for non _
columns names:
df.columns = df.columns.str.split('_', expand=True)
print (df)
step value0 value1 value2 Id
NaN NaN avg label1 label1 NaN
0 0 1.551959 0.579051 0.45 0.700000 1
1 1 1.515447 0.580128 0.00 1.000000 1
2 2 1.481759 0.277778 0.00 1.000000 1
3 0 1.461281 0.277778 0.00 1.000000 2
4 1 1.442006 0.439286 0.20 0.733333 2
5 2 1.415662 0.505467 0.10 0.766667 2
6 0 1.394186 0.515361 0.00 1.000000 3
7 1 1.374112 0.605263 0.00 1.000000 3
8 2 1.350135 0.803030 0.00 1.000000 3
But it is possible with set empty strings to first values of level, if not _
:
L = [x.split('_') if '_' in x else ['', x] for x in df.columns]
df.columns = pd.MultiIndex.from_tuples(L)
print (df)
value1 value2
step value0 avg label1 label1 Id
0 0 1.551959 0.579051 0.45 0.700000 1
1 1 1.515447 0.580128 0.00 1.000000 1
2 2 1.481759 0.277778 0.00 1.000000 1
3 0 1.461281 0.277778 0.00 1.000000 2
4 1 1.442006 0.439286 0.20 0.733333 2
5 2 1.415662 0.505467 0.10 0.766667 2
6 0 1.394186 0.515361 0.00 1.000000 3
7 1 1.374112 0.605263 0.00 1.000000 3
8 2 1.350135 0.803030 0.00 1.000000 3
But then selecting columns with tuples is possible also with empty strings (but it is only trick, the best is use first solution):
print (df[('','step')])
0 0
1 1
2 2
3 0
4 1
5 2
6 0
7 1
8 2
Name: (, step), dtype: int64
Upvotes: 2