Overwrite data in dataframes, based on largest value

Question

I have a set of data stored in a Pandas dataframe that contains tracking information for where parts are used. The data looks something like this:

In the above picture, I need to be able to overwrite the values highlighted in orange with the value highlighted in blue. This overwrite would need to be based on the item number (higher overwrites lower), as well as the project name. The controls are "Item", and "Project Name".

In the example above rows #2, #3, #4 would be overwritten with 98754-48. This would be based on the maximum value in "item" having that "Client Work ID", and the "Project Name" being the same.

This is reasonably straight forward to do in VBA, but due to the volumes involved (100k entries per day, up to 50k parts, 100 machines, and 100+ clients), this would become untenable very quickly

Is there any quick way that this could be achieved using dataframes, without lists that constantly need to be maintained?

Edit Clarified request, and added example.

phil · Accepted Answer

This should work:

(
    df
    .assign(**
        {'Client Work ID': lambda df: df.groupby('Project Name')['Item'].transform('max')}
    )
)

Basically, we group by Project Name, aggregate by the maximum Item number and expand it to all Client Work ID observations using transform.

Note: The assign statement looks a little complicated as your column names include spaces, which is why we cannot use 'Client Work ID' as a keyword argument. If you're able to change your column name formatting the pipeline would look simpler:

(
    df
    .assign(
        ClientWorkID = lambda df: df.groupby('Project Name')['Item'].transform('max')
    )
)

Overwrite data in dataframes, based on largest value

Answers (1)

Related Questions