Reputation: 571

PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance

I got following warning

PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider using pd.concat instead. To get a de-fragmented frame, use newframe = frame.copy()

when I tried to append multiple dataframes like

df1 = pd.DataFrame()
for file in files:
   df = pd.read(file)
   df['id'] = file        # <---- this line causes the warning
   df1 = df1.append(df, ignore_index =True)

I wonder if anyone can explain how copy() can avoid or reduce the fragment problem or suggest other different solutions to avoid the issues.

I tried to create a testing code to duplicate the problem but I don't see PerformanceWarning with a testing dataset (random integers). The same code would continue to produce warning when reading in the real dataset. It looks like something triggered the issues in the real dataset.

import pandas as pd
import numpy as np
import os
import glob
rows = 35000
cols = 1900
def gen_data(rows, cols, num_files):
    if not os.path.isdir('./data'):
        os.mkdir('./data')
        files = []
        for i in range(num_files):
            file = f'./data/{i}.pkl'
            pd.DataFrame(
                np.random.randint(1, 1_000, (rows, cols))
            ).to_pickle(file)
            files.append(file)
    return files

# Comment the first line to run real dataset, comment the second line will run the testing dataset
files = gen_data(rows, cols, 10) # testing dataset, runs okay
files = glob.glob('../pickles3/my_data_*.pickle') # real dataset, get performance warning

dfs = []
for file in files:
    df = pd.read_pickle(file)
    df['id'] = file

    dfs.append(df)

dfs = pd.concat(dfs, ignore_index = True)

Upvotes: 47

Answers (9)

ReferenceModel

Reputation: 1

Consider changing the Data Type (dtype) in your Data Frame. Solved my issue, because the Error Message was misleading.

I once got massive performance issues with a DataFrame, resulting in IDE termination, because I unknowingly used int64 instead of string/object in one column, for multiple lookups. The function ran 30min and then broke everything. But I knew that this worked before in a very similar version. The only difference I spotted was the dtype in one of the columns of the DF.

After I changed the column dtype from Integer to String (object), the program finished within 1 minute:

df['colum_name'] = df['column_name'].astype(str)

While the error message said "Fragmented Data", the problem was a completely different one.

Upvotes: 0

kho

Reputation: 1291

Timing the proposed solutions confirms what other responses have indicated: disregarding the warning is likely the quickest option in terms of computational efficiency.

Timing results

df_0.join(ps_1)
2.37 ms ± 116 μs per loop (mean ± std. dev. of 10 runs, 100 loops each)

pd.merge(df_0, ps_1, left_index=True, right_index=True)
2.28 ms ± 82.2 μs per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit -n 100 -r 10 pd.concat([df_0, ps_1], axis=1)
2.23 ms ± 107 μs per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit -n 100 -r 10 df_0[ps_1.name] = 1
18 μs ± 1.2 μs per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit -n 100 -r 10 df_0.loc[:, ps_1.name] = 1
21.1 μs ± 881 ns per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit -n 100 -r 10 df_0.assign(**{ps_1.name: 1})
124 μs ± 12.7 μs per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit -n 100 -r 10 df_0.assign(**{ps_1.name: ps_1})
216 μs ± 73.6 μs per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit -n 100 -r 10 df_0.copy()
207 μs ± 34.1 μs per loop (mean ± std. dev. of 10 runs, 100 loops each)

Tested with

import numpy as np
import pandas as pd

x = np.arange(100000)
df_0 = pd.DataFrame({f'col_{i}': x for i in range(50)})
ps_1 = pd.Series(1, index=df_0.index, name='col_new')

print('df_0.join(ps_1)')
%timeit -n 100 -r 10 df_0.join(ps_1)
print('\npd.merge(df_0, ps_1, left_index=True, right_index=True)')
%timeit -n 100 -r 10 pd.merge(df_0, ps_1, left_index=True, right_index=True)
print('\n%timeit -n 100 -r 10 pd.concat([df_0, ps_1], axis=1)')
%timeit -n 100 -r 10 pd.concat([df_0, ps_1], axis=1)

print('\n%timeit -n 100 -r 10 df_0[ps_1.name] = 1')
%timeit -n 100 -r 10 df_0[ps_1.name] = 1
df_0 = pd.DataFrame({f'col_{i}': x for i in range(5)})

print('\n%timeit -n 100 -r 10 df_0.loc[:, ps_1.name] = 1')
%timeit -n 100 -r 10 df_0.loc[:, ps_1.name] = 1
df_0 = pd.DataFrame({f'col_{i}': x for i in range(5)})

print('\n%timeit -n 100 -r 10 df_0.assign(**{ps_1.name: 1})')
%timeit -n 100 -r 10 df_0.assign(**{ps_1.name: 1})

print('\n%timeit -n 100 -r 10 df_0.assign(**{ps_1.name: ps_1})')
%timeit -n 100 -r 10 df_0.assign(**{ps_1.name: ps_1})

print('\n%timeit -n 100 -r 10 df_0.copy()')
%timeit -n 100 -r 10 df_0.copy()

Upvotes: 1

KingOtto

Reputation: 1483

Aware that this might be a reply that some will find highly controversial, I'm still posting my opinion here...

Proposed answer: Ignore the warning. If the user thinks/observes that the code suffers from poor performance, it's the user's responsibility to fix it, not the module's responsibility to propose code refactoring steps.

This can be done as follows (kudos to @daydaybroskii for the comment below)

import pandas as pd
from warnings import simplefilter
simplefilter(action="ignore", category=pd.errors.PerformanceWarning)

Rationale for this harsh reply: I am seeing this warning now that I have migrated to pandas v2.0.0 at many different places. Reason is that, at multiple places in the script, I remove and add records from dataframes, using many calls to .loc[] and .concat().

Now, given I am pretty savvy in vectorization, we perform these operations with performance in mind (e.g., never inside a for loop, but maybe ripping out an entire block of records, such as overwriting some "inner 20%" of the dataframe, after multiple pd.merge() operations - think of it as ETL operations on a database implemented in pandas instead of SQL). We see that the application runs incredibly fast, even though some dataframes contain ~4.5 mn records. More specifically: For one script, I get >50 of these warnings logged out in <0.3 seconds.. which I, subjectively, don't perceive as particularly "poor performance" (running a serial application with PyCharm in 'debugging' mode - so not exactly a setup in which you would expect best performance in the first place).

So, I conclude:

The code ran with pandas <2.0.0, and never raised a warning
The performance is excellent
We have multiple colleagues with a PhD in high-performance computing working on the code, and they believe it's fine
Module warning messages should not be abused for 'tutorials' or 'educational purposes' (even if well intented) - this is different than, for example, the "setting to copy of dataframe", where chances are very high that the functional behavior of the module leads to incorrect output. Here, it's just a 100% educational warning - that deserves, if anything, the logger level "info" (if not "debug"), certainly not "warning"
We get an incredibly dirty stdout log, for no reason
The warning itself is highly misleading - we don't have a single call to .insert() across the entire ecosystem - the fragmentation that we do have in our dataframes comes from many iterative, but fast, updates - so thanks for sending us down the wrong path

We will certainly not refactor a code that is showing excellent performance, and has been tested and validated over and over again, just because someone from the pandas team wants to educate us about stuff we know :/ If at least the performance was poor, I would welcome this message as a suggestion for improvement (even then: not a warning, but an 'info') - but given its current indiscriminate way of popping up: For once, it's actually the module that's the problem, not the user.

Edit: This is 100% the same issue as the following warning PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance. - which, despite warning me about "performance", pops up 28 times (!) in less than 3 seconds - again, in debugging mode of PyCharm. I'm pretty sure removing the warning alone would improve performance by 20% (or, 20 ms per operation ;)). Also, starts bothering as of pandas v2.0.0 and should be removed from the module altogether.

Upvotes: 40

tagny

Reputation: 9

In case some of the dataframes to be concatenated share some common columns, something like this might do the trick:

def update_df(df: pd.DataFrame, new_df: pd.DataFrame) -> pd.DataFrame:
    cols_to_update = new_df.columns.intersection(df.columns)
    cols_to_add = new_df.columns.difference(df.columns)
    df[cols_to_update] = new_df[cols_to_update]
    return pd.concat([df, new_df[cols_to_add]], axis=1)

Upvotes: 0

MK42

Reputation: 71

I've checked pandas source code (source) and the PerformanceWarning is quite simple: once more than 100 variables are created without specifying dtype or in different fashion than pd.concat then the PerfWarn is always shown, simple example (creating 1001 vars, each with 10k OBS):

import pandas as pd

def V1w():
    X = pd.DataFrame()
    for i in range(1001):
        y = pd.DataFrame({f'X{i}':[i]*10000})
        X.loc[:,f'X{i}'] = y
    return X
    
def V1nw():
    X = pd.DataFrame()
    for i in range(1001):
        y = pd.DataFrame({f'X{i}':[i]*10000}, dtype='Int64')
        X.loc[:,f'X{i}'] = y
    return X

def V2():
    X = []
    for i in range(1001):
        y = pd.DataFrame({f'X{i}':[i]*10000})
        X.append(y)
    X = pd.concat(X, axis=1)
    return X

Even though V1w performs in pratical terms same as V2:
V1w: 1.63 s ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
V1nw: 798 ms ± 8.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
V2: 1.37 s ± 5.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

yet, V1w will produce PerformanceWarning for all variables added from the point of 100+ and V2 not.

Additionally, specifying dtype in V1nw will suppress the PerfWarn and more importantly double the performance of the operation.

In summary, there's no compelling reason to display this warning, as it seems to be a legacy artifact from earlier versions of pandas. Its simplistic logic doesn't offer meaningful insights into your data or operations.

Solution? Well

import warnings
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)

Upvotes: 7

cottontail

Reputation: 23121

Assigning more than 100 non-extension dtype new columns causes this warning (source code).¹ For example, the following reproduces it:

df = pd.DataFrame(index=range(5))
df[[f"col{x}" for x in range(101)]] = range(101)    # <---- PerformanceWarning

Using extension dtype silences the warning.

df = pd.DataFrame(index=range(5))
df[[f"col{x}" for x in range(101)]] = pd.DataFrame([range(101)], index=df.index, dtype='Int64')  # <---- no warning

However, in most cases, pd.concat() as suggested by the warning is a better solution. For the case above, that would be as follows.

df = pd.DataFrame(index=range(5))
df = pd.concat([
    df, 
    pd.DataFrame([range(101)], columns=[f"col{x}" for x in range(101)], index=df.index)
], axis=1)

For the example in the OP, the following would silence the warning (because assign creates a copy).

dfs = pd.concat([pd.read_pickle(file).assign(id=file) for file in files], ignore_index=True)

^{1: New column(s) assignment is done via the __setitem__() method, which calls insert() method of the BlockManager object (the internal data structure that holds pandas dataframes). That's why the warning is saying insert is being called repeatedly.}

Upvotes: 4

Frank_Coumans

Reputation: 315

I had the same problem. This raised the PerformanceWarning:

df['col1'] = False
df['col2'] = 0
df['col3'] = 'foo'

This didn't:

df[['col1', 'col2', 'col3']] = (False, 0, 'foo')

This doesn't raise the warning either, but doesn't do anything about the underlying issue.

df.loc[:, 'col1'] = False
df.loc[:, 'col2'] = 0
df.loc[:, 'col3'] = 'foo'

Maybe you're adding single columns elsewhere?

copy() is supposed to consolidate the dataframe, and thus defragment. There was a bug fix in pandas 1.3.1 [GH 42579][1]. Copies on a larger dataframe might get expensive.

Tested on pandas 1.5.2, python 3.8.15 [1]: https://github.com/pandas-dev/pandas/pull/42579

Upvotes: 8

bruno-uy

Reputation: 1875

This is a problem with recent update. Check this issue from pandas-dev. It seems to be resolved in pandas version 1.3.1 (reference PR).

Upvotes: 2

Polkaguy6000

Reputation: 1208

append is not an efficient method for this operation. concat is more appropriate in this situation.

Replace

df1 = df1.append(df, ignore_index =True)

with

 pd.concat((df1,df),axis=0)

Details about the differences are in this question: Pandas DataFrame concat vs append

Upvotes: 15

PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance

Answers (9)

Timing results

Tested with

Related Questions