What is the fastest way to perform a replace on a column of a Pandas DataFrame based on the index of a separate Series?

Question

Sorry if I've been googling the wrong keywords, but I haven't been able to find an efficient way to replace all instances of an integer in a DataFrame column with its corresponding indexed value from a secondary Series.

I'm working with the output of a third party program that strips the row and column labels from an input matrix and replaces them with their corresponding indices. I'd like to restore the true labels from the indices.

I have a dummy example of the dataframe and series in question:

In [6]: df
Out[6]:
   idxA  idxB  var2
0     0     1   2.0
1     0     2   3.0
2     2     4   2.0
3     2     1   1.0

In [8]: labels
Out[8]:
0    A
1    B
2    C
3    D
4    E
Name: label, dtype: object

Currently, I'm converting the series to a dictionary and using replace:

label_dict = labels.to_dict()
df['idxA'] = df.idxA.replace(label_dict)
df['idxB'] = df.idxB.replace(label_dict)

which does give me the expected result:

In [12]: df
Out[12]:
  idxA idxB  var2
0    A    B   2.0
1    A    C   3.0
2    C    E   2.0
3    C    B   1.0

However, this is very slow for my full dataset (approximately 3.8 million rows in the table, and 19,000 labels). Is there a more efficient way to approach this?

Thanks!

EDIT: I accepted @coldspeed's answer. Couldn't paste a code block in the comment reply to his answer, but his solution sped up the dummy code by about an order of magnitude:

In [10]: %timeit df.idxA.replace(label_dict)
4.41 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: %timeit df.idxA.map(labels)
435 µs ± 3.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

cs95 · Accepted Answer

You can call map for each column using apply:

df.loc[:, 'idxA':'idxB'] = df.loc[:, 'idxA':'idxB'].apply(lambda x: x.map(labels))
df

  idxA idxB  var2
0    A    B   2.0
1    A    C   3.0
2    C    E   2.0
3    C    B   1.0

This is effectively iterating over every column (but the map operation for a single column is vectorized, so it is fast). It might just be faster to do

cols_of_interest = ['idxA', 'idxB', ...]
for c in cols_of_interest: df[c] = df[c].map(labels)

map is faster than replace, depending on the number of columns to replace. Your mileage may vary.

df_ = df.copy()
df = pd.concat([df_] * 10000, ignore_index=True)

%timeit df.loc[:, 'idxA':'idxB'].replace(labels)
%%timeit
for c in ['idxA', 'idxB']:
    df[c].map(labels)

6.55 ms ± 87.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.95 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

What is the fastest way to perform a replace on a column of a Pandas DataFrame based on the index of a separate Series?

Answers (1)

Related Questions