Pandas Dedupe in Python - how do I get the script to run automatically?

Question

I am trying to dedupe some data in a large excel spread sheet with 10,000 rows.

This is the script that I have:

import pandas as pd
import pandas_dedupe
df = pd.read_excel('Qualys-Working.xlsx')
df_final = pandas_dedupe.dedupe_dataframe(df,['IP','DNS','CONTROL_ID','INSTANCE'])
df_final.to_excel('Qualys-TimD-Working-NEW.xlsx',index=False)

But when I run the script it keeps on asking me to make choices about the data:

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
n
IP : 10.0.0.12
DNS : prd-sql-a5
CONTROL_ID : 999999.0
INSTANCE : None

IP : None
DNS : None
CONTROL_ID : None
INSTANCE : None

0/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious

It would take forever to run this against 10,000 rows and I am not sure of all the choices I'd have to make. How can I get this to run automatically?

iEriii · Accepted Answer

the pandas-dedupe library works as follows:

you label a sample of your dataset as duplicate or distinct records.
when you have labelled enough records, you press f (i.e. finish)
pandas-dedupe saves what learned in settings files (you will see them appear in your folder).

The next time you run pandas-dedupe, it will load the settings files automatically and dedupe your data based on what it has learned.

In summary, first you need to teach pandas-dedupe how to work; then it will do the job automatically for you.

Pandas Dedupe in Python - how do I get the script to run automatically?

Answers (1)

Related Questions