Kspr
Kspr

Reputation: 685

In pandas, how to check if two strings match multiple columns in any of the rows in existing data frame and delete it

I have a dataframe that looks like this

                 rootID   parentID    jobID  time                         counter
              0    A         B          D    2019-01-30 14:33:21.339469      0
              1    E         F          G    2019-01-30 14:33:21.812381      0
              2    A         C          D    2019-01-30 15:33:21.812381      0
              3    E         B          F    2019-01-30 15:33:21.812381      0
              4    E         F          G    2019-01-30 16:33:21.812381      0

I will be fed live data, and the data is a rootID string, parentID string, jobID string and a date.

I want to check if the newly retrieved rootID and parentID combination already exists in the dataframe. So if I retrieve rootID = "A" and parentID == "B", jobID =="T" I want to access the first row of the dataframe. I then want to delete that row and append the new information and increment the update counter.

         IF rootID and parentID exist in the same row in dataframe. 
         delete row and append new data with updated jobID and incremented counter. 

The dataframe should now look like

                 rootID   parentID    jobID  time                         counter
              0    E         F          G    2019-01-30 14:33:21.812381      0
              1    A         C          D    2019-01-30 15:33:21.812381      0
              2    E         B          F    2019-01-30 15:33:21.812381      0
              3    E         F          G    2019-01-30 16:33:21.812381      0
              4    A         B          T    2019-01-30 17:33:21.339469      1

Anyone have any idea how this could be done?

Upvotes: 0

Views: 459

Answers (2)

rafaelc
rafaelc

Reputation: 59284

I would

root_id, parent_id, job_id = get_ids() # e.g. ('A', 'B', 'T')

cond = df.rootID.eq(root_id) & df.parentID.eq(parent_id) & df.jobID.eq(job_id)

df.loc[cond, ['time', 'counter']] = datetime.datetime.now(), df.loc[cond, 'counter']+1

This would update your data frame in_place. I wouldn't reorder the Dataframe all the time, unless it is absolutely needed. If you can, for example, re-sort it once a day, you can just

df.sort_values(by='time') #.reset_index(drop=True), if needed

periodically. However, if you absolutely have to change the rows at each time new data comes, then, assuming you have unique ids,

df = pd.concat([df[df[cond].index.item()+1:], df[cond]]) 

Upvotes: 1

ldoe
ldoe

Reputation: 340

I think you can get a pretty close solution just by translating your example :

for index, row in df.iterrows():
    if row['rootID'] == rootID and row['parentID'] == parentID:
        df.drop(df.iloc[index])
        row['time'] = datetime.datetime.now()
        row['counter'] += 1
        df = df.concat(row)

Without loops:

selected_row = df[(df['rootId'] == rootID) & (df['parentID'] == parentID)])
df.drop(selected_row)
selected_row['time'] = datetime.datetime.now()
selected_row['counter'] += 1
df = df.concat(selected_row)

This assume you have only one row matching the rootID and parentID combination you are searching for.

Upvotes: 0

Related Questions