In pandas, how to check if two strings match multiple columns in any of the rows in existing data frame and delete it

Question

I have a dataframe that looks like this

                 rootID   parentID    jobID  time                         counter
              0    A         B          D    2019-01-30 14:33:21.339469      0
              1    E         F          G    2019-01-30 14:33:21.812381      0
              2    A         C          D    2019-01-30 15:33:21.812381      0
              3    E         B          F    2019-01-30 15:33:21.812381      0
              4    E         F          G    2019-01-30 16:33:21.812381      0

I will be fed live data, and the data is a rootID string, parentID string, jobID string and a date.

I want to check if the newly retrieved rootID and parentID combination already exists in the dataframe. So if I retrieve rootID = "A" and parentID == "B", jobID =="T" I want to access the first row of the dataframe. I then want to delete that row and append the new information and increment the update counter.

         IF rootID and parentID exist in the same row in dataframe. 
         delete row and append new data with updated jobID and incremented counter.

The dataframe should now look like

                 rootID   parentID    jobID  time                         counter
              0    E         F          G    2019-01-30 14:33:21.812381      0
              1    A         C          D    2019-01-30 15:33:21.812381      0
              2    E         B          F    2019-01-30 15:33:21.812381      0
              3    E         F          G    2019-01-30 16:33:21.812381      0
              4    A         B          T    2019-01-30 17:33:21.339469      1

Anyone have any idea how this could be done?

rafaelc · Accepted Answer

I would

root_id, parent_id, job_id = get_ids() # e.g. ('A', 'B', 'T')

cond = df.rootID.eq(root_id) & df.parentID.eq(parent_id) & df.jobID.eq(job_id)

df.loc[cond, ['time', 'counter']] = datetime.datetime.now(), df.loc[cond, 'counter']+1

This would update your data frame in_place. I wouldn't reorder the Dataframe all the time, unless it is absolutely needed. If you can, for example, re-sort it once a day, you can just

df.sort_values(by='time') #.reset_index(drop=True), if needed

periodically. However, if you absolutely have to change the rows at each time new data comes, then, assuming you have unique ids,

df = pd.concat([df[df[cond].index.item()+1:], df[cond]])

In pandas, how to check if two strings match multiple columns in any of the rows in existing data frame and delete it

Answers (2)

Related Questions