bluethundr
bluethundr

Reputation: 1325

Pandas Dedupe in Python - how do I get the script to run automatically?

I am trying to dedupe some data in a large excel spread sheet with 10,000 rows.

This is the script that I have:

import pandas as pd
import pandas_dedupe
df = pd.read_excel('Qualys-Working.xlsx')
df_final = pandas_dedupe.dedupe_dataframe(df,['IP','DNS','CONTROL_ID','INSTANCE'])
df_final.to_excel('Qualys-TimD-Working-NEW.xlsx',index=False)

But when I run the script it keeps on asking me to make choices about the data:

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
n
IP : 10.0.0.12
DNS : prd-sql-a5
CONTROL_ID : 999999.0
INSTANCE : None

IP : None
DNS : None
CONTROL_ID : None
INSTANCE : None

0/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious

It would take forever to run this against 10,000 rows and I am not sure of all the choices I'd have to make. How can I get this to run automatically?

Upvotes: 1

Views: 346

Answers (1)

iEriii
iEriii

Reputation: 401

the pandas-dedupe library works as follows:

  • you label a sample of your dataset as duplicate or distinct records.
  • when you have labelled enough records, you press f (i.e. finish)
  • pandas-dedupe saves what learned in settings files (you will see them appear in your folder).

The next time you run pandas-dedupe, it will load the settings files automatically and dedupe your data based on what it has learned.

In summary, first you need to teach pandas-dedupe how to work; then it will do the job automatically for you.

Upvotes: 1

Related Questions