Reputation: 1
I am working on a project where I have a xlsx of records. Some records have multiple instances. Like this:
Id, Field 1, Field 2, Field 3
ABC01, 0.034, 0.044, 0.050
ABC01, 0.055, 0.065, 0.060
ABC01, 0.047, 0.082, 0.070
DEF02, 0.034, 0.044, 0.050
DEF02, 0.055, 0.065, 0.060
XYZ05, 0.047, 0.082, 0.070
My goal is to write the entire row of each instance with the highest value in Field 3 into a new xlsx. I would also like non-duplicated records to be copied over as well.
So far I have used openpyxl to iterate column one to make a list of values. I have created a list of values for Field 3. And I have created a list of duplicate ID values. Finally, I know how to write rows to a new woorkbook if I have a list of row numbers.
What I need is how I can get a list of the row numbers of the IDs with the highest values in Fields 3.
Thanks!
Upvotes: 0
Views: 198
Reputation: 6302
You could consider using pandas for this. It supports Excel IO.
import pandas as pd
df = pd.read_excel("data.xlsx")
indices = df.groupby('Id')["Field 3"].idxmax()
deduped_df = df.ix[indices]
deduped_df.to_excel("deduped_data.xlsx")
Upvotes: 1