re-indexing a pandas dataframe from the column names

Question

I have a pandas data frame formatted as following:

    step    value0     value1_avg  ...   value1_label1   value2_label1  ...  Id
0      0  1.551959       0.579051  ...            0.45        0.700000  ...   1
1      1  1.515447       0.580128  ...            0.00        1.000000  ...   1
2      2  1.481759       0.277778  ...            0.00        1.000000  ...   1
3      0  1.461281       0.277778  ...            0.00        1.000000  ...   2
4      1  1.442006       0.439286  ...            0.20        0.733333  ...   2
5      2  1.415662       0.505467  ...            0.10        0.766667  ...   2
6      0  1.394186       0.515361  ...            0.00        1.000000  ...   3
7      1  1.374112       0.605263  ...            0.00        1.000000  ...   3
8      2  1.350135       0.803030  ...            0.00        1.000000  ...   3

I need to reformat it like this:

                           Value1  ...          Value1          Value2  ...  
    step    value0            avg  ...          label1          label1  ...  Id
0      0  1.551959       0.579051  ...            0.45        0.700000  ...   1
1      1  1.515447       0.580128  ...            0.00        1.000000  ...   1
2      2  1.481759       0.277778  ...            0.00        1.000000  ...   1
3      0  1.461281       0.277778  ...            0.00        1.000000  ...   2
4      1  1.442006       0.439286  ...            0.20        0.733333  ...   2
5      2  1.415662       0.505467  ...            0.10        0.766667  ...   2
6      0  1.394186       0.515361  ...            0.00        1.000000  ...   3
7      1  1.374112       0.605263  ...            0.00        1.000000  ...   3
8      2  1.350135       0.803030  ...            0.00        1.000000  ...   3

Unfortunately, I am a bit lost with pandas multi-indexing system. Could someone provides me some advice to reformat the data frame as needed?

Thanks a lot.

jezrael · Accepted Answer

I think here is best 'sent' all columns with no _ to MultiIndex in index, so if use str.split get nice MultiIndex in columns:

df = df.set_index([c for c in df.columns if '_' not in c])
df.columns = df.columns.str.split('_', expand=True)
print (df)
                    value1           value2
                       avg label1    label1
step value0   Id                           
0    1.551959 1   0.579051   0.45  0.700000
1    1.515447 1   0.580128   0.00  1.000000
2    1.481759 1   0.277778   0.00  1.000000
0    1.461281 2   0.277778   0.00  1.000000
1    1.442006 2   0.439286   0.20  0.733333
2    1.415662 2   0.505467   0.10  0.766667
0    1.394186 3   0.515361   0.00  1.000000
1    1.374112 3   0.605263   0.00  1.000000
2    1.350135 3   0.803030   0.00  1.000000

Because if use only split get missing values for non _ columns names:

df.columns = df.columns.str.split('_', expand=True)
print (df)
  step    value0    value1           value2  Id
   NaN       NaN       avg label1    label1 NaN
0    0  1.551959  0.579051   0.45  0.700000   1
1    1  1.515447  0.580128   0.00  1.000000   1
2    2  1.481759  0.277778   0.00  1.000000   1
3    0  1.461281  0.277778   0.00  1.000000   2
4    1  1.442006  0.439286   0.20  0.733333   2
5    2  1.415662  0.505467   0.10  0.766667   2
6    0  1.394186  0.515361   0.00  1.000000   3
7    1  1.374112  0.605263   0.00  1.000000   3
8    2  1.350135  0.803030   0.00  1.000000   3

But it is possible with set empty strings to first values of level, if not _:

L = [x.split('_') if '_' in x else ['', x] for x in df.columns]
df.columns = pd.MultiIndex.from_tuples(L)
print (df)
                    value1           value2   
  step    value0       avg label1    label1 Id
0    0  1.551959  0.579051   0.45  0.700000  1
1    1  1.515447  0.580128   0.00  1.000000  1
2    2  1.481759  0.277778   0.00  1.000000  1
3    0  1.461281  0.277778   0.00  1.000000  2
4    1  1.442006  0.439286   0.20  0.733333  2
5    2  1.415662  0.505467   0.10  0.766667  2
6    0  1.394186  0.515361   0.00  1.000000  3
7    1  1.374112  0.605263   0.00  1.000000  3
8    2  1.350135  0.803030   0.00  1.000000  3

But then selecting columns with tuples is possible also with empty strings (but it is only trick, the best is use first solution):

print (df[('','step')])
0    0
1    1
2    2
3    0
4    1
5    2
6    0
7    1
8    2
Name: (, step), dtype: int64

re-indexing a pandas dataframe from the column names

Answers (1)

Related Questions