Anthony Cronin
Anthony Cronin

Reputation: 1

Finding a list of row numbers based on values in separate columns?

I am working on a project where I have a xlsx of records. Some records have multiple instances. Like this:

Id, Field 1, Field 2, Field 3
ABC01, 0.034, 0.044, 0.050
ABC01, 0.055, 0.065, 0.060
ABC01, 0.047, 0.082, 0.070
DEF02, 0.034, 0.044, 0.050
DEF02, 0.055, 0.065, 0.060
XYZ05, 0.047, 0.082, 0.070

My goal is to write the entire row of each instance with the highest value in Field 3 into a new xlsx. I would also like non-duplicated records to be copied over as well.

So far I have used openpyxl to iterate column one to make a list of values. I have created a list of values for Field 3. And I have created a list of duplicate ID values. Finally, I know how to write rows to a new woorkbook if I have a list of row numbers.

What I need is how I can get a list of the row numbers of the IDs with the highest values in Fields 3.

Thanks!

Upvotes: 0

Views: 198

Answers (1)

Martin Valgur
Martin Valgur

Reputation: 6302

You could consider using pandas for this. It supports Excel IO.

import pandas as pd

df = pd.read_excel("data.xlsx")
indices = df.groupby('Id')["Field 3"].idxmax()
deduped_df = df.ix[indices]
deduped_df.to_excel("deduped_data.xlsx")

Upvotes: 1

Related Questions