oymonk
oymonk

Reputation: 353

Too many columns resulting in `PerformanceWarning: DataFrame is highly fragmented`

I have a list of filepaths in the first column of a dataframe. My goal is to create a second column that represents file categories, with categories reflecting the words in the filepath.

import pandas as pd
import numpy as np

data = {'filepath':  ['C:/barracuda/document.doc', 'C:/dog/document.doc', 'C:/cat/document.doc']
        }

df = pd.DataFrame(data)

df["Animal"] =(df['filepath'].str.contains("dog|cat",case=False,regex=True))

df["Fish"] =(df['filepath'].str.contains("barracuda",case=False))

df = df.loc[:, 'filepath':'Fish'].replace(True, pd.Series(df.columns, df.columns))

df = df.loc[:, 'filepath':'Fish'].replace(False,np.nan)

def squeeze_nan(x):
    original_columns = x.index.tolist()
    squeezed = x.dropna()
    squeezed.index = [original_columns[n] for n in range(squeezed.count())]

    return squeezed.reindex(original_columns, fill_value=np.nan)

df = df.apply(squeeze_nan, axis=1) 
print(df)

This code works. The problem arises when I have 200 statements beginning with df['columnName'] =. Because I have so many, I get the error:

PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()

To fix this I have tried:

dfAnimal = df.copy

dfAnimal['Animal'] = dfAnimal['filepath'].str.contains("dog|cat",case=False,regex=True)

dfFish = df.copy
dfFish["Fish"] =dfFish['filepath'].str.contains("barracuda",case=False)

df = pd.concat(dfAnimal,dfFish)

The above gives me errors such as method object is not iterable and method object is not subscriptable. I then tried df = df.loc[df['filepath'].isin(['cat','dog'])] but this only works when 'cat' or 'dog' is the only word in the column. How do I avoid the performance error?

Upvotes: 0

Views: 1377

Answers (1)

user17242583
user17242583

Reputation:

Try creating all your new columns in a dict, and then convert that dict into a dataframe, and then use pd.concat to add the resulting dataframe (containing the new columns) to the original dataframe:

new_columns = {
    'Animal': df['filepath'].str.contains("dog|cat",case=False,regex=True),
    'Fish': df['filepath'].str.contains("barracuda",case=False),
}

new_df = pd.DataFrame(new_columns)
df = pd.concat([df, new_df], axis=1)

Added to your original code, it would be something like this:

import pandas as pd
import numpy as np

data = {'filepath':  ['C:/barracuda/document.doc', 'C:/dog/document.doc', 'C:/cat/document.doc']
        }

df = pd.DataFrame(data)


##### These are the new lines #####

new_columns = {
    'Animal': df['filepath'].str.contains("dog|cat",case=False,regex=True),
    'Fish': df['filepath'].str.contains("barracuda",case=False),
}
new_df = pd.DataFrame(new_columns)
df = pd.concat([df, new_df], axis=1)

##### End of new lines #####

df = df.loc[:, 'filepath':'Fish'].replace(True, pd.Series(df.columns, df.columns))

df = df.loc[:, 'filepath':'Fish'].replace(False,np.nan)

def squeeze_nan(x):
    original_columns = x.index.tolist()
    squeezed = x.dropna()
    squeezed.index = [original_columns[n] for n in range(squeezed.count())]

    return squeezed.reindex(original_columns, fill_value=np.nan)

df = df.apply(squeeze_nan, axis=1) 
print(df)

Upvotes: 1

Related Questions