Using Python 3, how can I speed up the update of a dataframe using current methods?

Question

I have a dataframe with ~5MM records. I have to update a column with a new value based on either an exact match to another column or a partial match. I've written the code to do this, but I could never figure out how to chain filters. Right now, my code updates the column for every single filter requirement. Right now my code looks like this.

In SQL it looks like:

UPDATE table
SET column1 = 'Value' 
WHERE column2 IN ('123','456','789')

OR

UPDATE table
SET column1 = 'a different value'
WHERE 1 = 1
AND column2 LIKE '987%'
OR column2 LIKE '654%'

Currently, in Python, I have:

df.loc[df['column2'] == '123', 'column1 '] = 'Value'
df.loc[df['column2'] == '456', 'column1 '] = 'Value'
df.loc[df['column2'] == '789', 'column1 '] = 'Value'

or

df.loc[df.column2.str.startswith('987'),'COMMODITY']='a different value'
df.loc[df.column2.str.startswith('654'),'COMMODITY']='a different value'

I'm guessing that Python has to loop through all 5MM records for each line which is super inefficient. What is the better way to do this?

It's been suggested that this is a duplicate of How to implement 'in' and 'not in' for Pandas dataframe. This question is five years old, and isin() doesn't work like that anymore.

BENY · Accepted Answer

How about using np.where with isin

df[ 'column1 '] = np.where(df.column2.isin(['123','456','789']),'Value',df.column1)

About the startwiths

df.COMMODITY=np.where(df.column2.str.startswith(('987','654')),'a different value',df.COMMODITY)

Using Python 3, how can I speed up the update of a dataframe using current methods?

Answers (1)

Related Questions