Avoiding loops when building a Pandas DataFrame

Question

I have an initial Pandas DataFrame with three columns, including one which contains a list of strings. The goal is to split each row into as many elements as there are items in the obj columns, so that for instance this:

from    to      obj
--------------------
abc     xyz     [foo, bar]
def     uvw     [gee]
ghi     rst     [foo, bar, baz]

becomes this:

from    to      obj
--------------------
abc     xyz     foo
abc     xyz     bar
def     uvw     gee
ghi     rst     foo
ghi     rst     bar
ghi     rst     baz

Currently I'm doing it like this:

transformed = pd.DataFrame(columns=['from', 'to', 'obj'])

for index, row in origin.iterrows():
    for obj in row['obj']:
        transformed = transformed.append(pd.Series({
            'from':     row['from'],
            'to':       row['to'],
            'obj':      obj
        }), ignore_index=True)

This works perfectly fine, except it's painfully slow. If origin has 100,000 elements, it can easily take up to one hour to compute transformed.

Is there a vectorised way of getting to the same result, without having to resort to Python loops?

jpp · Accepted Answer

In essence, you are repeating or chaining values according to your column.

So you can use np.repeat and itertools.chain as appropriate. The solution is efficient for a small number of columns, as in your example.

import numpy as np
from itertools import chain

# set up dataframe
df = pd.DataFrame({'from': ['abc', 'def', 'gfhi'],
                   'to': ['xyz', 'uvw', 'rst'],
                   'obj': [['foo', 'bar'], ['gee'], ['foo', 'bar', 'baz']]})

# calculate length of each list in obj
lens = df['obj'].map(len)

# calculate result, repeating or chaining as appropriate
res = pd.DataFrame({'from': np.repeat(df['from'], lens),
                    'to': np.repeat(df['to'], lens),
                    'obj': list(chain.from_iterable(df['obj']))})

print(res)

   from   to  obj
0   abc  xyz  foo
0   abc  xyz  bar
1   def  uvw  gee
2  gfhi  rst  foo
2  gfhi  rst  bar
2  gfhi  rst  baz

Avoiding loops when building a Pandas DataFrame

Answers (1)

Related Questions