Simon Krannig
Simon Krannig

Reputation: 23

Stacking dataframes in pandas

I am looking for a way to stack a dataframe in pandas in the following manner:

id     val
id01   ['a', 'b']
id02   ['b']
id03   []
id04   ['a','c']

to:

id01   'a'
       'b'
id02   'b'
id04   'a'
       'c'

Upvotes: 2

Views: 135

Answers (2)

jpp
jpp

Reputation: 164673

For a Pandorable solution, @jezrael's solution is good.

If efficiency is a concern, building a new dataframe is often more efficient than stack:

import pandas as pd, numpy as np
from itertools import chain

df = pd.DataFrame({'id': ['id01', 'id02', 'id03', 'id04'],
                   'val': [['a', 'b'], ['b'], [], ['a', 'c']]})

lens = list(map(len, df['val']))

res = pd.DataFrame({'id': np.repeat(df['id'].values, lens),
                    'val': list(chain.from_iterable(df['val']))})

print(res)

#      id val
# 0  id01   a
# 1  id01   b
# 2  id02   b
# 3  id04   a
# 4  id04   c

Performance benchmarking

import pandas as pd, numpy as np
from itertools import chain

df = pd.DataFrame({'id': ['id01', 'id02', 'id03', 'id04'],
                   'val': [['a', 'b'], ['b'], [], ['a', 'c']]})

df = pd.concat([df]*100000)

def jp(df):
    lens = list(map(len, df['val']))

    res = pd.DataFrame({'id': np.repeat(df['id'].values, lens),
                        'val': list(chain.from_iterable(df['val']))})

    return res

def jez(df):
    return (pd.DataFrame(df['val'].values.tolist(), index=df['id'])
              .stack()
              .reset_index(level=1, drop=True)
              .reset_index(name='val'))

%timeit jp(df)   # 137 ms per loop
%timeit jez(df)  # 197 ms per loop

Upvotes: 0

jezrael
jezrael

Reputation: 862681

If want Series with MultiIndex use stack with DataFrame contructor:

s = pd.DataFrame(df['val'].values.tolist(), index=df['id']).stack()
print (s)
id     
id01  0    a
      1    b
id02  0    b
id04  0    a
      1    c
dtype: object

If need DataFrame add double reset_index - first fore remove second level of MultiIndex and second for create column from index:

df = (pd.DataFrame(df['val'].values.tolist(), index=df['id'])
        .stack()
        .reset_index(level=1, drop=True)
        .reset_index(name='val'))
print (df)
     id val
0  id01   a
1  id01   b
2  id02   b
3  id04   a
4  id04   c

Last if need replace duplicated values use loc + duplicated:

df.loc[df['id'].duplicated(), 'id'] = ''
print (df)
     id val
0  id01   a
1         b
2  id02   b
3  id04   a
4         c

Upvotes: 1

Related Questions