Reputation: 571
I got following warning
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling
frame.insert
many times, which has poor performance. Consider using pd.concat instead. To get a de-fragmented frame, usenewframe = frame.copy()
when I tried to append multiple dataframes like
df1 = pd.DataFrame()
for file in files:
df = pd.read(file)
df['id'] = file # <---- this line causes the warning
df1 = df1.append(df, ignore_index =True)
I wonder if anyone can explain how copy() can avoid or reduce the fragment problem or suggest other different solutions to avoid the issues.
I tried to create a testing code to duplicate the problem but I don't see PerformanceWarning
with a testing dataset (random integers). The same code would continue to produce warning when reading in the real dataset. It looks like something triggered the issues in the real dataset.
import pandas as pd
import numpy as np
import os
import glob
rows = 35000
cols = 1900
def gen_data(rows, cols, num_files):
if not os.path.isdir('./data'):
os.mkdir('./data')
files = []
for i in range(num_files):
file = f'./data/{i}.pkl'
pd.DataFrame(
np.random.randint(1, 1_000, (rows, cols))
).to_pickle(file)
files.append(file)
return files
# Comment the first line to run real dataset, comment the second line will run the testing dataset
files = gen_data(rows, cols, 10) # testing dataset, runs okay
files = glob.glob('../pickles3/my_data_*.pickle') # real dataset, get performance warning
dfs = []
for file in files:
df = pd.read_pickle(file)
df['id'] = file
dfs.append(df)
dfs = pd.concat(dfs, ignore_index = True)
Upvotes: 47
Views: 137258
Reputation: 1
Consider changing the Data Type (dtype) in your Data Frame. Solved my issue, because the Error Message was misleading.
I once got massive performance issues with a DataFrame, resulting in IDE termination, because I unknowingly used int64 instead of string/object in one column, for multiple lookups. The function ran 30min and then broke everything. But I knew that this worked before in a very similar version. The only difference I spotted was the dtype in one of the columns of the DF.
After I changed the column dtype from Integer to String (object), the program finished within 1 minute:
df['colum_name'] = df['column_name'].astype(str)
While the error message said "Fragmented Data", the problem was a completely different one.
Upvotes: 0
Reputation: 1291
Timing the proposed solutions confirms what other responses have indicated: disregarding the warning is likely the quickest option in terms of computational efficiency.
df_0.join(ps_1)
2.37 ms ± 116 μs per loop (mean ± std. dev. of 10 runs, 100 loops each)
pd.merge(df_0, ps_1, left_index=True, right_index=True)
2.28 ms ± 82.2 μs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit -n 100 -r 10 pd.concat([df_0, ps_1], axis=1)
2.23 ms ± 107 μs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit -n 100 -r 10 df_0[ps_1.name] = 1
18 μs ± 1.2 μs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit -n 100 -r 10 df_0.loc[:, ps_1.name] = 1
21.1 μs ± 881 ns per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit -n 100 -r 10 df_0.assign(**{ps_1.name: 1})
124 μs ± 12.7 μs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit -n 100 -r 10 df_0.assign(**{ps_1.name: ps_1})
216 μs ± 73.6 μs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit -n 100 -r 10 df_0.copy()
207 μs ± 34.1 μs per loop (mean ± std. dev. of 10 runs, 100 loops each)
import numpy as np
import pandas as pd
x = np.arange(100000)
df_0 = pd.DataFrame({f'col_{i}': x for i in range(50)})
ps_1 = pd.Series(1, index=df_0.index, name='col_new')
print('df_0.join(ps_1)')
%timeit -n 100 -r 10 df_0.join(ps_1)
print('\npd.merge(df_0, ps_1, left_index=True, right_index=True)')
%timeit -n 100 -r 10 pd.merge(df_0, ps_1, left_index=True, right_index=True)
print('\n%timeit -n 100 -r 10 pd.concat([df_0, ps_1], axis=1)')
%timeit -n 100 -r 10 pd.concat([df_0, ps_1], axis=1)
print('\n%timeit -n 100 -r 10 df_0[ps_1.name] = 1')
%timeit -n 100 -r 10 df_0[ps_1.name] = 1
df_0 = pd.DataFrame({f'col_{i}': x for i in range(5)})
print('\n%timeit -n 100 -r 10 df_0.loc[:, ps_1.name] = 1')
%timeit -n 100 -r 10 df_0.loc[:, ps_1.name] = 1
df_0 = pd.DataFrame({f'col_{i}': x for i in range(5)})
print('\n%timeit -n 100 -r 10 df_0.assign(**{ps_1.name: 1})')
%timeit -n 100 -r 10 df_0.assign(**{ps_1.name: 1})
print('\n%timeit -n 100 -r 10 df_0.assign(**{ps_1.name: ps_1})')
%timeit -n 100 -r 10 df_0.assign(**{ps_1.name: ps_1})
print('\n%timeit -n 100 -r 10 df_0.copy()')
%timeit -n 100 -r 10 df_0.copy()
Upvotes: 1
Reputation: 1483
Aware that this might be a reply that some will find highly controversial, I'm still posting my opinion here...
Proposed answer: Ignore the warning. If the user thinks/observes that the code suffers from poor performance, it's the user's responsibility to fix it, not the module's responsibility to propose code refactoring steps.
This can be done as follows (kudos to @daydaybroskii for the comment below)
import pandas as pd
from warnings import simplefilter
simplefilter(action="ignore", category=pd.errors.PerformanceWarning)
Rationale for this harsh reply:
I am seeing this warning now that I have migrated to pandas v2.0.0
at many different places. Reason is that, at multiple places in the script, I remove and add records from dataframes, using many calls to .loc[]
and .concat()
.
Now, given I am pretty savvy in vectorization, we perform these operations with performance in mind (e.g., never inside a for loop, but maybe ripping out an entire block of records, such as overwriting some "inner 20%" of the dataframe, after multiple pd.merge()
operations - think of it as ETL operations on a database implemented in pandas instead of SQL). We see that the application runs incredibly fast, even though some dataframes contain ~4.5 mn records. More specifically: For one script, I get >50 of these warnings logged out in <0.3 seconds.. which I, subjectively, don't perceive as particularly "poor performance" (running a serial application with PyCharm in 'debugging' mode - so not exactly a setup in which you would expect best performance in the first place).
So, I conclude:
<2.0.0
, and never raised a warning.insert()
across the entire ecosystem - the fragmentation that we do have in our dataframes comes from many iterative, but fast, updates - so thanks for sending us down the wrong pathWe will certainly not refactor a code that is showing excellent performance, and has been tested and validated over and over again, just because someone from the pandas team wants to educate us about stuff we know :/ If at least the performance was poor, I would welcome this message as a suggestion for improvement (even then: not a warning, but an 'info') - but given its current indiscriminate way of popping up: For once, it's actually the module that's the problem, not the user.
Edit: This is 100% the same issue as the following warning PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
- which, despite warning me about "performance", pops up 28 times (!) in less than 3 seconds - again, in debugging mode of PyCharm. I'm pretty sure removing the warning alone would improve performance by 20% (or, 20 ms per operation ;)). Also, starts bothering as of pandas v2.0.0
and should be removed from the module altogether.
Upvotes: 40
Reputation: 9
In case some of the dataframes to be concatenated share some common columns, something like this might do the trick:
def update_df(df: pd.DataFrame, new_df: pd.DataFrame) -> pd.DataFrame:
cols_to_update = new_df.columns.intersection(df.columns)
cols_to_add = new_df.columns.difference(df.columns)
df[cols_to_update] = new_df[cols_to_update]
return pd.concat([df, new_df[cols_to_add]], axis=1)
Upvotes: 0
Reputation: 71
I've checked pandas source code (source) and the PerformanceWarning is quite simple: once more than 100 variables are created without specifying dtype or in different fashion than pd.concat then the PerfWarn is always shown, simple example (creating 1001 vars, each with 10k OBS):
import pandas as pd
def V1w():
X = pd.DataFrame()
for i in range(1001):
y = pd.DataFrame({f'X{i}':[i]*10000})
X.loc[:,f'X{i}'] = y
return X
def V1nw():
X = pd.DataFrame()
for i in range(1001):
y = pd.DataFrame({f'X{i}':[i]*10000}, dtype='Int64')
X.loc[:,f'X{i}'] = y
return X
def V2():
X = []
for i in range(1001):
y = pd.DataFrame({f'X{i}':[i]*10000})
X.append(y)
X = pd.concat(X, axis=1)
return X
Even though V1w performs in pratical terms same as V2:
V1w: 1.63 s ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
V1nw: 798 ms ± 8.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
V2: 1.37 s ± 5.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
yet, V1w will produce PerformanceWarning for all variables added from the point of 100+ and V2 not.
Additionally, specifying dtype in V1nw will suppress the PerfWarn and more importantly double the performance of the operation.
In summary, there's no compelling reason to display this warning, as it seems to be a legacy artifact from earlier versions of pandas. Its simplistic logic doesn't offer meaningful insights into your data or operations.
Solution? Well
import warnings
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
Upvotes: 7
Reputation: 23121
Assigning more than 100 non-extension dtype new columns causes this warning (source code).1 For example, the following reproduces it:
df = pd.DataFrame(index=range(5))
df[[f"col{x}" for x in range(101)]] = range(101) # <---- PerformanceWarning
Using extension dtype silences the warning.
df = pd.DataFrame(index=range(5))
df[[f"col{x}" for x in range(101)]] = pd.DataFrame([range(101)], index=df.index, dtype='Int64') # <---- no warning
However, in most cases, pd.concat()
as suggested by the warning is a better solution. For the case above, that would be as follows.
df = pd.DataFrame(index=range(5))
df = pd.concat([
df,
pd.DataFrame([range(101)], columns=[f"col{x}" for x in range(101)], index=df.index)
], axis=1)
For the example in the OP, the following would silence the warning (because assign
creates a copy).
dfs = pd.concat([pd.read_pickle(file).assign(id=file) for file in files], ignore_index=True)
1: New column(s) assignment is done via the __setitem__()
method, which calls insert()
method of the BlockManager object (the internal data structure that holds pandas dataframes). That's why the warning is saying insert
is being called repeatedly.
Upvotes: 4
Reputation: 315
I had the same problem. This raised the PerformanceWarning:
df['col1'] = False
df['col2'] = 0
df['col3'] = 'foo'
This didn't:
df[['col1', 'col2', 'col3']] = (False, 0, 'foo')
This doesn't raise the warning either, but doesn't do anything about the underlying issue.
df.loc[:, 'col1'] = False
df.loc[:, 'col2'] = 0
df.loc[:, 'col3'] = 'foo'
Maybe you're adding single columns elsewhere?
copy() is supposed to consolidate the dataframe, and thus defragment. There was a bug fix in pandas 1.3.1 [GH 42579][1]. Copies on a larger dataframe might get expensive.
Tested on pandas 1.5.2, python 3.8.15 [1]: https://github.com/pandas-dev/pandas/pull/42579
Upvotes: 8
Reputation: 1875
This is a problem with recent update. Check this issue from pandas-dev
. It seems to be resolved in pandas
version 1.3.1
(reference PR).
Upvotes: 2
Reputation: 1208
append
is not an efficient method for this operation. concat
is more appropriate in this situation.
Replace
df1 = df1.append(df, ignore_index =True)
with
pd.concat((df1,df),axis=0)
Details about the differences are in this question: Pandas DataFrame concat vs append
Upvotes: 15