Reputation: 441
I am trying to replace one column by another if the values in that column are equal to a string. The value of this string is "wo". If this shows up in column y, replace by column x. Currently I use the following code:
df.y.replace("wo",df.x)
This runs for a very long time (millions of observations, equals days of calculations).
Is there a more efficient way to replace ?
Just in case, the data looks as follows:
y x other variables
1 mo something
2 2 something
3 3 something
wo >5 something
4 4 something
wo 7 something
It has to look like:
y x other variables
1 mo something
2 2 something
3 3 something
>5 >5 something
4 4 something
7 7 something
Upvotes: 2
Views: 148
Reputation: 210862
try this:
df.loc[(df.y == 'wo'), 'y'] = df.x
it will first filter only those rows where df.y == 'wo'
and will assign x
column's value to 'y' column
Timeit report:
In [304]: %timeit df.y.replace("wo",df.x)
100 loops, best of 3: 13.9 ms per loop
In [305]: %timeit df.loc[(df.y == 'wo'), 'y'] = df.x
100 loops, best of 3: 3.31 ms per loop
In [306]: %timeit df.ix[(df.y == 'wo'), 'y'] = df.x
100 loops, best of 3: 3.31 ms per loop
UPDATE: starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.
Upvotes: 4
Reputation: 2120
First, Pandas should be notified that that string value "wo" represents IEEE double NaN (aka numpy nan etc). See for example the na_values
parameter of the read_csv
method here. This will allow the entire column to be stored as double
which will increase efficiency. Then use something like this to replace the NaN values with the values from the other column.
Upvotes: 2