Reputation: 1325
I am trying to dedupe some data in a large excel spread sheet with 10,000 rows.
This is the script that I have:
import pandas as pd
import pandas_dedupe
df = pd.read_excel('Qualys-Working.xlsx')
df_final = pandas_dedupe.dedupe_dataframe(df,['IP','DNS','CONTROL_ID','INSTANCE'])
df_final.to_excel('Qualys-TimD-Working-NEW.xlsx',index=False)
But when I run the script it keeps on asking me to make choices about the data:
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
n
IP : 10.0.0.12
DNS : prd-sql-a5
CONTROL_ID : 999999.0
INSTANCE : None
IP : None
DNS : None
CONTROL_ID : None
INSTANCE : None
0/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
It would take forever to run this against 10,000 rows and I am not sure of all the choices I'd have to make. How can I get this to run automatically?
Upvotes: 1
Views: 346
Reputation: 401
the pandas-dedupe library works as follows:
The next time you run pandas-dedupe, it will load the settings files automatically and dedupe your data based on what it has learned.
In summary, first you need to teach pandas-dedupe how to work; then it will do the job automatically for you.
Upvotes: 1