Neil
Neil

Reputation: 7202

Pandas slow on data frame replace

I have an Excel file (.xlsx) with about 800 rows and 128 columns with pretty dense data in the grid. There are about 9500 cells that I am trying to replace the cell values of using Pandas data frame:

xlsx = pandas.ExcelFile(filename)
frame = xlsx.parse(xlsx.sheet_names[0])
media_frame = frame[media_headers] # just get the cols that need replacing

from_filenames = get_from_filenames() # returns ~9500 filenames to replace in DF
to_filenames = get_to_filenames()

media_frame = media_frame.replace(from_filenames, to_filenames)
frame.update(media_frame)
frame.to_excel(filename)

The replace() takes 60 seconds. Any way to speed this up? This is not huge data or task, I was expecting pandas to move much faster. FYI I tried doing the same processing with same file in CSV, but the time savings was minimal (about 50 seconds on the replace())

Upvotes: 5

Views: 2187

Answers (3)

Netro
Netro

Reputation: 7297

I found creating new col and dropping the existing column one is faster than waiting forever. ;)

Upvotes: 0

Neil
Neil

Reputation: 7202

I got the 60 second task to complete in 10 seconds by removing replace() altogether and using set_value() one element at a time.

Upvotes: 1

piRSquared
piRSquared

Reputation: 294218

strategy
create pd.Series representing a map from filenames to filenames.
stack our dataframe, map, then unstack

setup

import pandas as pd
import numpy as np
from string import letters

media_frame = pd.DataFrame(
    pd.DataFrame(
        np.random.choice(list(letters), 9500 * 800 * 3) \
          .reshape(3, -1)).sum().values.reshape(9500, -1))

u = np.unique(media_frame.values)
from_filenames = pd.Series(u)
to_filenames = from_filenames.str[1:] + from_filenames.str[0]

m = pd.Series(to_filenames.values, from_filenames.values)

solution

media_frame.stack().map(m).unstack()

timing

5 x 5 dataframe

enter image description here

100 x 100

enter image description here

9500 x 800

enter image description here

9500 x 800
map using series vs dict
d = dict(zip(from_filenames, to_filenames))

enter image description here

Upvotes: 7

Related Questions