Jernej
Jernej

Reputation: 352

mitigating a performance warning from pandas (DataFrame is highly fragmented)

Suppose we have a function bar(df) that returns a numpy array of length len(df) given a dataframe df.

Now consider the idiom

def foo(df):
    for i in range(N):
        df['FOO_' + str(i)] = bar(df)
    return df

A recent pandas update started to cause the following warning

PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()

As far as I understand, a way to mitigate this is to change the above code to the following idiom

def foo2(df):
    frames = [df]
    for i in range(N):
        frames += [pd.Series(bar(df), index=df.index)]
    return pd.concat(frames, axis=1)

The above code fixes the warning, but results in much worse execution times.

In [110]: %timeit foo()
1.73 s ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [111]: %timeit foo2()
2.51 s ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

A fix to suppress a performance warning that introduces extra overhead seems silly. Therefore my question is

How can I fix the warning while also obtaning better performance. In other words, is there a way to improve the function foo2 to offer better performance than foo?

Upvotes: 14

Views: 9498

Answers (2)

Jonas Hörsch
Jonas Hörsch

Reputation: 535

A very efficient and elegant way to build up a dataframe column-wise from numpy arrays is to build a dictionary of the columns and then convert them at once to the desired dataframe, since this means the cost for consolidating the data and managing the indexing has to be paid only once.

If I extend @SultanOrazbayev's example with a foo2_dict variant:

def foo2_dict(df):
    new_columns = pd.DataFrame({f"FOO_{i}": bar(df) for i in range(N)}, index=df.index)
    return pd.concat([df, new_columns], axis=1)

I see an additional factor 6 improvement from foo2_opt (390ms) to foo2_dict (58ms), but this is of course highly dependent on the actual implementation of the underlying bar function.

Note also that the speed improvement by using the dictionary comprehension is of secondary order to the bunch numpy -> pandas conversion, ie. a foo2_incremental_dict takes 59ms for me.

def foo2_incremental_dict(df):
    new_columns = {}
    for i in range(N):
        new_columns[f"FOO_{i}"] = bar(df)
    new_columns = pd.DataFrame(new_columns, index=df.index)

    return pd.concat([df, new_columns], axis=1)

Upvotes: 6

SultanOrazbayev
SultanOrazbayev

Reputation: 16561

One opportunity to optimize the code is to reframe += as a list comprehension:

import pandas as pd

N = 5000
df = pd.DataFrame(index=[_ for _ in range(100)])


def bar(df):
    return np.random.rand(len(df))


def foo2_orig(df):
    frames = [df]
    for i in range(N):
        frames += [pd.Series(bar(df), index=df.index)]
    return pd.concat(frames, axis=1)


def foo2_opt(df):
    frames = pd.concat([pd.Series(bar(df), index=df.index) for i in range(N)], axis=1)
    return pd.concat([df, frames], axis=1)

On my machine I see a factor of 2 performance improvement, although I'm not sure if the 100 rows and 5000 columns are realistic for your situation. The reason for speed-up is that list comprehensions are more efficient.

Update:

if the list of column names is known (list_col_names), then the individual series can be assigned custom column names using name kwarg:

def foo2_opt(df):
    frames = pd.concat([pd.Series(bar(df), index=df.index, name=col_name) for i, col_name in zip(range(N), list_col_names)], axis=1)
    return pd.concat([df, frames], axis=1)

Upvotes: 2

Related Questions