Python fast DataFrame concatenation

Question

I wrote a code to concatenate parts of a DataFrame to the same DataFrame as to normalize the occurrence of rows as per a certain column.

import random

def normalize(data, expectation):
    """Normalize data by duplicating existing rows"""

    counts = data[expectation].value_counts()
    max_count = int(counts.max())

    for tag, group in data.groupby(expectation, sort=False):
        array = pandas.DataFrame(columns=data.columns.values)

        i = 0
        while i < (max_count // int(counts[tag])):
            array = pandas.concat([array, group])
            i += 1

        i = max_count % counts[tag]
        if i > 0:
            array = pandas.concat([array, group.ix[random.sample(group.index, i)]])

        data = pandas.concat([data, array])

    return data

and this is unbelievably slow. Is there a way to fast concatenate DataFrame without creating copies of it?

Ami Tavory · Accepted Answer

There are a couple of things that stand out.

To begin with, the loop

i = 0
while i < (max_count // int(counts[tag])):
    array = pandas.concat([array, group])
    i += 1

is going to be very slow. Pandas is not built for these dynamic concatenations, and I suspect the performance is quadratic for what you're doing.

Instead, perhaps you could try

pandas.concat([group] * (max_count // int(counts[tag]))

which just creates a list first, and then calls concat for a one-shot concatenation on the entire list. This should bring the complexity to being linear, and I suspect it will have lower constants in any case.

Another thing which would reduce these small concats is calling groupby-apply. Instead of iterating over the result of groupby, write the loop body as a function, and call apply on it. Let Pandas figure out best how to concat all of the results into a single DataFrame.

However, even if you prefer to keep the loop, I'd just append things into a list, and just concat everything at the end:

stuff = []
for tag, group in data.groupby(expectation, sort=False):
    # Call stuff.append for any DataFrame you were going to concat.
pandas.concat(stuff)

Python fast DataFrame concatenation

Answers (1)

Related Questions