Cord Kaldemeyer
Cord Kaldemeyer

Reputation: 6897

Pandas: How do I split multiple lists in columns into multiple rows?

I have a pandas DataFrame that looks like the following:

     bus_uid   bus_type    type                      obj_uid  \
0     biomass: DEB31    biomass  output       Simple_139804698384200   
0     biomass: DEB31    biomass   other                        duals   
0     biomass: DEB31    biomass   other                       excess   

                                         datetime  \
0   DatetimeIndex(['2015-01-01 00:00:00',  '2015-01-01 01:00:00',  '2015-01-01 02:00:00', ...   
0   DatetimeIndex(['2015-01-01 00:00:00',  '2015-01-01 01:00:00',  '2015-01-01 02:00:00', ...   
0   DatetimeIndex(['2015-01-01 00:00:00',  '2015-01-01 01:00:00',  '2015-01-01 02:00:00', ...   

                                           values  
0   [1.0, 2.0, 3.0, ...  
0   [4.0, 5.0, 6.0, ...  
0   [7.0, 8.0, 9.0, ...

And want to convert it into the following format:

     bus_uid   bus_type    type                          obj_uid  datetime             values
0     biomass: DEB31    biomass  output   Simple_139804698384200  2015-01-01 00:00:00  1.0
0     biomass: DEB31    biomass  output   Simple_139804698384200  2015-01-01 01:00:00  2.0
0     biomass: DEB31    biomass  output   Simple_139804698384200  2015-01-01 02:00:00  3.0
0     biomass: DEB31    biomass   other                    duals  2015-01-01 00:00:00  4.0
0     biomass: DEB31    biomass   other                    duals  2015-01-01 01:00:00  5.0
0     biomass: DEB31    biomass   other                    duals  2015-01-01 02:00:00  6.0
0     biomass: DEB31    biomass   other                   excess  2015-01-01 00:00:00  7.0
0     biomass: DEB31    biomass   other                   excess  2015-01-01 01:00:00  8.0
0     biomass: DEB31    biomass   other                   excess  2015-01-01 02:00:00  9.0

The columns datetime and values have the same dimension.

I have already asked a similar question here but couldn't manage to apply the solution for my problem with two columns.

What's the best way to convert the DataFrame into the required format?

Upvotes: 4

Views: 1752

Answers (2)

Stefan
Stefan

Reputation: 42875

You could iterate through the rows to extract the Index and Series info from the cells. I don't think that reshaping methods work well when you need to extract info at the same time:

Sample data:

rows = 3
df = pd.DataFrame(data={'bus_uid': list(repeat('biomass: DEB31', rows)), 'type': list(repeat('biomass', 3)), 'id': ['id1', 'id2', 'id3'], 'datetime': list(repeat(pd.DatetimeIndex(start=datetime(2016,1,1), periods=3, freq='D'), rows)), 'values': list(repeat([1,2,3], rows))})

          bus_uid                                           datetime   id  \
0  biomass: DEB31  DatetimeIndex(['2016-01-01', '2016-01-02', '20...  id1   
1  biomass: DEB31  DatetimeIndex(['2016-01-01', '2016-01-02', '20...  id2   
2  biomass: DEB31  DatetimeIndex(['2016-01-01', '2016-01-02', '20...  id3   

      type     values  
0  biomass  [1, 2, 3]  
1  biomass  [1, 2, 3]  
2  biomass  [1, 2, 3]  

Build new DataFrame as you iterate through the DataFrame rows:

new_df = pd.DataFrame()
for index, cols in df.iterrows():
    extract_df = pd.DataFrame.from_dict({'datetime': cols.ix['datetime'], 'values': cols.ix['values']})
    extract_df = pd.concat([extract_df, cols.drop(['datetime', 'values']).to_frame().T], axis=1).fillna(method='ffill').fillna(method='bfill')
    new_df = pd.concat([new_df, extract_df], ignore_index=True)

to get:

    datetime  values         bus_uid   id     type
0 2016-01-01       1  biomass: DEB31  id1  biomass
1 2016-01-02       2  biomass: DEB31  id1  biomass
2 2016-01-03       3  biomass: DEB31  id1  biomass
3 2016-01-01       1  biomass: DEB31  id2  biomass
4 2016-01-02       2  biomass: DEB31  id2  biomass
5 2016-01-03       3  biomass: DEB31  id2  biomass
6 2016-01-01       1  biomass: DEB31  id3  biomass
7 2016-01-02       2  biomass: DEB31  id3  biomass
8 2016-01-03       3  biomass: DEB31  id3  biomass

Upvotes: 2

jezrael
jezrael

Reputation: 862511

You can extract from columns values and datetime new Series and then merge them with original dataframe df by concat:

s1 = df['values'].apply(pd.Series, 1).stack()
s1.index = s1.index.droplevel(-1) # to line up with df's index
s1.name = 'values' # needs a name to join

s2 = df['datetime'].apply(pd.Series, 1).stack()
s2.index = s2.index.droplevel(-1) # to line up with df's index
s2.name = 'datetime' # needs a name to join

#remove duplicity columns
df = df.drop( ['values', 'datetime'], axis=1)

#concat all together
df= pd.concat([df,s1,s2], axis=1).reset_index(drop=True)
print df

   bus_uid        bus_type            type                 obj_uid values  \
0        0  biomass: DEB31  biomass output  Simple_139804698384200    1.0   
1        0  biomass: DEB31  biomass output  Simple_139804698384200    2.0   
2        0  biomass: DEB31  biomass output  Simple_139804698384200    3.0   
3        0  biomass: DEB31   biomass other                   duals    4.0   
4        0  biomass: DEB31   biomass other                   duals    5.0   
5        0  biomass: DEB31   biomass other                   duals    6.0   
6        0  biomass: DEB31   biomass other                  excess    7.0   
7        0  biomass: DEB31   biomass other                  excess    8.0   
8        0  biomass: DEB31   biomass other                  excess    9.0   

             datetime  
0 2015-01-01 00:00:00  
1 2015-01-01 01:00:00  
2 2015-01-01 02:00:00  
3 2015-01-01 00:00:00  
4 2015-01-01 01:00:00  
5 2015-01-01 02:00:00  
6 2015-01-01 00:00:00  
7 2015-01-01 01:00:00  
8 2015-01-01 02:00:00  

Upvotes: 2

Related Questions