Rtut
Rtut

Reputation: 1007

Pandas- split a multiple index dataframe

I have a multiple indexed pandas data frame where I want to separate values by '||' character and include one more layer of index with three new columns 'Connection', 'Val1' 'Val2'.

Will be helpful if someone could give me few hints to do this.

Current example Data Frame:

                              Experiment1                        Experiment2      
    Target      Analyze1_ab               Analyze2_zz            Analyze1_yy        
    XXX_1   Edge2||3.1E-07||-0.5    Edge2||2.1E-06||-0.9     Edge2||6.4E-02||-0.3   
    XXX_4   Edge1||6.4E-12||1.1     Edge1||2.4E-11||9.4      Edge1||1.4E-11||1.4    
    ABC_1   Edge1||3.9E-07||0.7     Edge1||2.9E-07||5.6      Edge1||6.8E-02||0.4    
    ABC_2   Edge2||1.1E-09||-0.5    Edge2||1.2E-09||1.2      Edge2||1.0E-03||-0.5   
    ABC_3   Edge2||4.6E-25||-0.8    Edge2||2.6E-10||1.9      Edge2||5.0E-17||-0.9   
    XXX_2   Edge2||1.7E-07||-0.5    Edge2||5.7E-08||-0.3     Edge2||4.1E-02||-0.3   
    ABC_4   Edge1||8.1E-02||0.5     Edge1||9.1E-02||1.5      Edge1||5.4E-02||0.6    
    ABC_5   Edge1||6.7E-02||0.3     Edge1||4.2E-02||1.9      Edge1||5.6E-03||0.4    
    XXX_3   Edge2||3.1E-03||-0.4    Edge1||2.4E-11||1.1      Edge2||2.4E-02||-0.3   

Desired Data Frame:

                                   Experiment1                            Experiment2       
    Target          Analyze1_ab                  Analyze2_zz                Analyze1_yy     
           Connection   Val1    Val2    Connection  Val1    Val2    Connection  Val1    Val2
    XXX_1   Edge2   3.10E-07    -0.5    Edge2   2.10E-06    -0.9    Edge2   6.40E-02    -0.3
    XXX_4   Edge1   6.40E-12    1.1     Edge1   2.40E-11    9.4     Edge1   1.40E-11    1.4
    ABC_1   Edge1   3.90E-07    0.7     Edge1   2.90E-07    5.6     Edge1   6.80E-02    0.4
    ABC_2   Edge2   1.10E-09    -0.5    Edge2   1.20E-09    1.2     Edge2   1.00E-03    -0.5
    ABC_3   Edge2   4.60E-25    -0.8    Edge2   2.60E-10    1.9     Edge2   5.00E-17    -0.9
    XXX_2   Edge2   1.70E-07    -0.5    Edge2   5.70E-08    -0.3    Edge2   4.10E-02    -0.3
    ABC_4   Edge1   8.10E-02    0.5     Edge1   9.10E-02    1.5     Edge1   5.40E-02    0.6
    ABC_5   Edge1   6.70E-02    0.3     Edge1   4.20E-02    1.9     Edge1   5.60E-03    0.4
    XXX_3   Edge2   3.10E-03    -0.4    Edge1   2.40E-11    1.1     Edge2   2.40E-02    -0.3

Upvotes: 4

Views: 2040

Answers (3)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210852

You can do it this way:

In [319]: x = df.iloc[:, 1:]
     ...: x.columns = list('abc')
     ...: x = x.stack().str.extractall('([^\|]+)').unstack([-1, -2])
     ...: x.columns = range(9)
     ...:
     ...: x = df.iloc[:, 0].to_frame('Target').join(x)
     ...:
     ...: x.columns = \
     ...: pd.MultiIndex.from_tuples([('', 'Target', '')]) \
     ...:   .union(pd.MultiIndex.from_tuples(
     ...:             [(t[0][0], t[0][1], t[1])
     ...:              for t in product(df.columns[1:].to_series().tolist(),
     ...:              ['Connection', 'Val1', 'Val2'])
     ...:             ]))
     ...:

Result:

In [320]: x
Out[320]:
         Experiment1                                           Experiment2
  Target Analyze1_ab                Analyze2_zz                Analyze1_yy
          Connection     Val1  Val2  Connection     Val1  Val2  Connection     Val1  Val2
0  XXX_1       Edge2  3.1E-07  -0.5       Edge2  2.1E-06  -0.9       Edge2  6.4E-02  -0.3
1  XXX_4       Edge1  6.4E-12   1.1       Edge1  2.4E-11   9.4       Edge1  1.4E-11   1.4
2  ABC_1       Edge1  3.9E-07   0.7       Edge1  2.9E-07   5.6       Edge1  6.8E-02   0.4
3  ABC_2       Edge2  1.1E-09  -0.5       Edge2  1.2E-09   1.2       Edge2  1.0E-03  -0.5
4  ABC_3       Edge2  4.6E-25  -0.8       Edge2  2.6E-10   1.9       Edge2  5.0E-17  -0.9
5  XXX_2       Edge2  1.7E-07  -0.5       Edge2  5.7E-08  -0.3       Edge2  4.1E-02  -0.3
6  ABC_4       Edge1  8.1E-02   0.5       Edge1  9.1E-02   1.5       Edge1  5.4E-02   0.6
7  ABC_5       Edge1  6.7E-02   0.3       Edge1  4.2E-02   1.9       Edge1  5.6E-03   0.4
8  XXX_3       Edge2  3.1E-03  -0.4       Edge1  2.4E-11   1.1       Edge2  2.4E-02  -0.3

Upvotes: 1

tuomastik
tuomastik

Reputation: 4906

import pandas as pd

# Initialize DataFrame
# -----------------------------------------------------------------------------

df = pd.DataFrame({
    'Analyze1_ab': ['Edge2||3.1E-07||-0.5', 'Edge1||6.4E-12||1.1'],
    'Analyze2_zz': ['Edge2||2.1E-06||-0.9', 'Edge1||2.4E-11||9.4'],
    'Analyze1_yy': ['Edge2||6.4E-02||-0.3', 'Edge1||1.4E-11||1.4'],
    'Target': ['XXX_1', 'XXX_4'],})

df.columns = pd.MultiIndex.from_tuples(
    [('Experiment1', 'Analyze1_ab'),
     ('Experiment2', 'Analyze1_yy'),
     ('Experiment1', 'Analyze2_zz'),
     ('Target', '')])

# Split 'Analyses' columns by double pipes ||
# -----------------------------------------------------------------------------

# Initialize final DataFrame
final_df = pd.DataFrame()

for col_name in df.columns:
    if (col_name[1].startswith('Analyze') and
            df[col_name].str.contains('||').all()):
        # Split 'Analysis' by || into new columns
        splitted_analysis = df[col_name].str.split('\|\|', expand=True)
        # The new column names are 0, 1, 2. Let's rename them.
        splitted_analysis.columns = ['Connection', 'Val1', 'Val2']
        # Recreate MultiIndex
        splitted_analysis.columns = pd.MultiIndex.from_tuples(
            [(col_name[0], col_name[1], c) for c in splitted_analysis.columns])
        # Concatenate the new columns to the final_df
        final_df = pd.concat(objs=[final_df, splitted_analysis], axis=1)

# Add 'Target' column in the final_df.
# First, extract it.
target_col = pd.DataFrame(df[('Target', '')])
# Then, increase MultiIndex level of 'Target' from 2 to 3,
# to allow smooth concatenation with the final_df.
target_col.columns = pd.MultiIndex.from_tuples([('Target', '', '')])
final_df = pd.concat([final_df, target_col], axis=1)

Validation: print(final_df):

  Experiment1                Experiment2                Experiment1                Target
  Analyze1_ab                Analyze1_yy                Analyze2_zz                      
   Connection     Val1  Val2  Connection     Val1  Val2  Connection     Val1  Val2       
0       Edge2  3.1E-07  -0.5       Edge2  6.4E-02  -0.3       Edge2  2.1E-06  -0.9  XXX_1
1       Edge1  6.4E-12   1.1       Edge1  1.4E-11   1.4       Edge1  2.4E-11   9.4  XXX_4

Validation: pprint.pprint([c for c in final_df.columns]):

[('Experiment1', 'Analyze1_ab', 'Connection'),
 ('Experiment1', 'Analyze1_ab', 'Val1'),
 ('Experiment1', 'Analyze1_ab', 'Val2'),
 ('Experiment2', 'Analyze1_yy', 'Connection'),
 ('Experiment2', 'Analyze1_yy', 'Val1'),
 ('Experiment2', 'Analyze1_yy', 'Val2'),
 ('Experiment1', 'Analyze2_zz', 'Connection'),
 ('Experiment1', 'Analyze2_zz', 'Val1'),
 ('Experiment1', 'Analyze2_zz', 'Val2'),
 ('Target', '', '')]

Upvotes: 2

Allen Qin
Allen Qin

Reputation: 19947

Setup

df
Out[2319]: 
      ID           Experiment1                                 Experiment2
  Target           Analyze1_ab           Analyze2_ab           Analyze1_yy
0  XXX_1  Edge2||3.1E-07||-0.5  Edge2||2.1E-06||-0.9  Edge2||6.4E-02||-0.3
1  XXX_4   Edge1||6.4E-12||1.1   Edge1||2.4E-11||9.4   Edge1||1.4E-11||1.4
2  ABC_1   Edge1||3.9E-07||0.7   Edge1||2.9E-07||5.6   Edge1||6.8E-02||0.4
3  ABC_2  Edge2||1.1E-09||-0.5   Edge2||1.2E-09||1.2  Edge2||1.0E-03||-0.5
4  ABC_3  Edge2||4.6E-25||-0.8   Edge2||2.6E-10||1.9  Edge2||5.0E-17||-0.9
5  XXX_2  Edge2||1.7E-07||-0.5  Edge2||5.7E-08||-0.3  Edge2||4.1E-02||-0.3
6  ABC_4   Edge1||8.1E-02||0.5   Edge1||9.1E-02||1.5   Edge1||5.4E-02||0.6
7  ABC_5   Edge1||6.7E-02||0.3   Edge1||4.2E-02||1.9   Edge1||5.6E-03||0.4
8  XXX_3  Edge2||3.1E-03||-0.4   Edge1||2.4E-11||1.1  Edge2||2.4E-02||-0.3

Solution

#split columns by '||' and rebuild a Dataframe with the separated columns
df2 = pd.DataFrame(np.asarray(df.iloc[:,1:].apply(lambda x: x.str.split('\|\|')).values.tolist()).reshape(9,-1))
#set Multilevel columns
df2.columns=pd.MultiIndex.from_tuples([('Experiment1','Analyze1_ab','Connection'),
('Experiment1','Analyze1_ab','Val1'),
('Experiment1','Analyze1_ab','Val2'),
('Experiment1','Analyze2_zz','Connection'),
('Experiment1','Analyze2_zz','Val1'),
('Experiment1','Analyze2_zz','Val2'),
('Experiment2','Analyze1_yy','Connection'),
('Experiment2','Analyze1_yy','Val1'),
('Experiment2','Analyze1_yy','Val2')])
#add Target column
df2.insert(0,'Target',df.iloc[:,0])

Out[2324]: 
  Target Experiment1                                           Experiment2  
         Analyze1_ab                Analyze2_zz                Analyze1_yy   
          Connection     Val1  Val2  Connection     Val1  Val2  Connection     Val1  Val2 
0  XXX_1       Edge2  3.1E-07  -0.5       Edge2  2.1E-06  -0.9       Edge2  6.4E-02  -0.3 
1  XXX_4       Edge1  6.4E-12   1.1       Edge1  2.4E-11   9.4       Edge1  1.4E-11   1.4 
2  ABC_1       Edge1  3.9E-07   0.7       Edge1  2.9E-07   5.6       Edge1  6.8E-02   0.4 
3  ABC_2       Edge2  1.1E-09  -0.5       Edge2  1.2E-09   1.2       Edge2  1.0E-03  -0.5 
4  ABC_3       Edge2  4.6E-25  -0.8       Edge2  2.6E-10   1.9       Edge2  5.0E-17  -0.9 
5  XXX_2       Edge2  1.7E-07  -0.5       Edge2  5.7E-08  -0.3       Edge2  4.1E-02  -0.3 
6  ABC_4       Edge1  8.1E-02   0.5       Edge1  9.1E-02   1.5       Edge1  5.4E-02   0.6 
7  ABC_5       Edge1  6.7E-02   0.3       Edge1  4.2E-02   1.9       Edge1  5.6E-03   0.4 
8  XXX_3       Edge2  3.1E-03  -0.4       Edge1  2.4E-11   1.1       Edge2  2.4E-02  -0.3

Upvotes: 4

Related Questions