Gambit1614
Gambit1614

Reputation: 8811

Split columns into MultiIndex with missing columns in pandas

This is similar to the problem I asked here. However, I found out that the data I am working is not always consistent. For, example say :

import pandas as pd

df = pd.DataFrame(pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]],columns=["X_a","Y_c","X_b","Y_a"]))

   X_a  Y_c  X_b  Y_a
0    1    2    3    4
1    5    6    7    8
2    9   10   11   12

Now you can see that X does not have corresponding c column and Y does not have corresponding b column. Now when I want to create the multi-level index, I want the dataframe to look like this:

     X             Y
     a    b   c    a    b   c
0    1    3   -1   4   -1   2
1    5    7   -1   8   -1   6
2    9   11   -1  12   -1  10

So as you can see, I want the split in such a way that all upper level columns should have the same lower level columns. Since, the dataset is positve, I am thinking of filling the missing columns with -1, although I am open for suggestions on this. The closest thing I found to my problem was this answer. However, I cannot make it to somehow work with MultiLevel Index like in my previous question. Any help is appreciated.

Upvotes: 13

Views: 5090

Answers (1)

cs95
cs95

Reputation: 402852

Create a MultiIndex and set df.columns.

idx = df.columns.str.split('_', expand=True)
idx
MultiIndex(levels=[['X', 'Y'], ['a', 'b', 'c']],
           labels=[[0, 1, 0, 1], [0, 2, 1, 0]])

df.columns = idx

Now, with the existing MultiIndex, create a new index and use that to reindex the original.

idx = pd.MultiIndex.from_product([idx.levels[0], idx.levels[1]])
idx
MultiIndex(levels=[['X', 'Y'], ['a', 'b', 'c']],
       labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

df.reindex(columns=idx, fill_value=-1)
   X          Y       
   a   b  c   a  b   c
0  1   3 -1   4 -1   2
1  5   7 -1   8 -1   6
2  9  11 -1  12 -1  10

Upvotes: 24

Related Questions