More Than Five
More Than Five

Reputation: 10419

Flatten a list of elements in Pandas DataFrame

My data-structure is:

ds = [{
    "name": "groupA",
    "subGroups": [123,456]
},
{
    "name": "groupB",
    "subGroups": ['aaa', 'bbb' , 'ccc']
}]

This gives the following dataframe

df = pd.DataFrame(ds)

    name    subGroups
0   groupA  [123, 456]
1   groupB  [aaa, bbb, ccc]   

I want:

    name    subGroupsFlattend
0   groupA  123
1   groupA  456
2   groupB  aaa
3   groupB  bbb
4   groupB  ccc

Any ideas?

Upvotes: 10

Views: 7133

Answers (4)

Jake Reece
Jake Reece

Reputation: 1168

Use explode:

df = df.explode('subGroups')

Upvotes: 10

Kliment Merzlyakov
Kliment Merzlyakov

Reputation: 1083

YOBEN_S solution, but much more efficient for big dataframes.

from itertools import chain
pd.DataFrame({'name':df.name.repeat(df.subGroups.str.len()),
              'subGroup':list(chain.from_iterable(df.subGroups.to_list()))})

Upvotes: 0

jezrael
jezrael

Reputation: 862741

You can use json_normalize:

from pandas.io.json import json_normalize

df = json_normalize(ds,  ['subGroups'], 'name').rename(columns={0:'subGroupsFlattend'})
print (df)
  subGroupsFlattend    name
0               123  groupA
1               456  groupA
2               aaa  groupB
3               bbb  groupB
4               ccc  groupB

Alternative solution with flattening dictionaries:

L = [y for x in ds for y in zip(x["subGroups"], [x["name"]] * len(x["subGroups"]))]
print (L)
[(123, 'groupA'), (456, 'groupA'), ('aaa', 'groupB'), ('bbb', 'groupB'), ('ccc', 'groupB')]

df = pd.DataFrame(L, columns=['subGroupsFlattend','name'])
print (df)
  subGroupsFlattend    name
0               123  groupA
1               456  groupA
2               aaa  groupB
3               bbb  groupB
4               ccc  groupB

EDIT:

from itertools import chain
df = pd.DataFrame(ds)

df1 = pd.DataFrame({
    'subGroups' : list(chain.from_iterable(df['subGroups'].tolist())), 
    'name' : df['name'].values.repeat(df['subGroups'].str.len())
})
print (df1)
     name subGroups
0  groupA       123
1  groupA       456
2  groupB       aaa
3  groupB       bbb
4  groupB       ccc

Upvotes: 3

BENY
BENY

Reputation: 323276

You can fix your output by following :

pd.DataFrame({'name':df.name.repeat(df.subGroups.str.len()),'subGroup':df.subGroups.sum()})
Out[364]: 
     name subGroup
0  groupA      123
0  groupA      456
1  groupB      aaa
1  groupB      bbb
1  groupB      ccc

Upvotes: 5

Related Questions