Anita
Anita

Reputation: 285

Expand nested lists to rows, create headers, and map back to original columns

I would like to expand the nested lists to multiple rows and columns. At the same time, map back the results to the corresponding column values.

The dataframe is like the following.

df=pd.DataFrame({
'column_name':['income_level', 'geo_level'],
'results':[[[0, 12, 13], [0, 98, 43], [1, 29, 73], [2, 12, 34]], [[0, 78, 23], [1, 56, 67], [2, 67, 34]]]}) 

column_name  |  results
----------------------
income_level | [[0, 12, 13], [0, 98, 43], [1, 29, 73], [2, 12, 34]]
geo_level    | [[0, 78, 23], [1, 56, 67], [2, 67, 34]]

The final results I'm looking for are like this. (expanding the nested list to rows and columns and matching the corresponding column values)

column_name  | num |pct | index
income_level | 0   | 12 | 13 
income_level | 0   | 98 | 43
income_level | 1   | 29 | 73 
income_level | 2   | 12 | 34
geo_level    | 0   | 78 | 23
geo_level    | 1   | 56 | 67
geo_level    | 2   | 67 | 34

My current code:

pd.DataFrame(list(itertools.chain(*df['results'].values.tolist())), columns=['num', 'pct', 'index'])

I'm able to expand and create header but I cannot match back to corresponding column values (i.e. the column_name)

Upvotes: 1

Views: 884

Answers (2)

Valdi_Bo
Valdi_Bo

Reputation: 31011

Try the following solution:

Define a "reformatting" function:

def reform(row):
    res = pd.DataFrame(row.results, columns=['num', 'pct', 'index'])
    res.insert(0, 'column_name', row.column_name)
    return res

Then apply it and concatenate results:

lst = df.apply(reform, axis=1).tolist()
pd.concat(lst).reset_index(drop=True)

Upvotes: 0

Andy L.
Andy L.

Reputation: 25269

Explode column results and assign to df1. Create the new dataframe from list of sublist of df1.results and reset_index

df1 = df.explode('results')
pd.DataFrame(df1.results.tolist(), 
             index=df1.column_name,
             columns=['num', 'pct', 'index']).reset_index()

Out[562]:
    column_name  num  pct  index
0  income_level    0   12     13
1  income_level    0   98     43
2  income_level    1   29     73
3  income_level    2   12     34
4     geo_level    0   78     23
5     geo_level    1   56     67
6     geo_level    2   67     34

On pandas < 0.25, use sum, np.repeat, and reset_index to achieve the same thing

pd.DataFrame(df.results.sum(), 
             index=np.repeat(df.column_name, df.results.str.len()), 
             columns=['num', 'pct', 'index']).reset_index()

Out[572]:
    column_name  num  pct  index
0  income_level    0   12     13
1  income_level    0   98     43
2  income_level    1   29     73
3  income_level    2   12     34
4     geo_level    0   78     23
5     geo_level    1   56     67
6     geo_level    2   67     34

Upvotes: 1

Related Questions