Drop duplicate Pandas and SQLAlchemy

Question

I am trying to make a script for a call centre that wishes to be able to upload millions of records from a csv file to a database, filtering out all duplicate phone numbers from the upload. To do this I am using Pandas and SQLAlchemy

df = read_csv('test.csv')

rd = models.session.query(Test).all()

I know there is the drop_duplicates() in pandas but can only find examples of removing duplicates from the same dataframe. Is this even applicable in my case

UPDATE:

This is what I have so far thanks to the help of others.

df = read_csv('phones.csv')

result_dict = [u.__dict__ for u in models.session.query(Dedupe).all()]

df['tel'] = df.index

rd = DataFrame.from_dict(result_dict)

print df[~df['tel'].isin(rd['tel'].unique())]

It is still printing out all of the csv. Even if there are duplicates

kennes · Accepted Answer

I think this illustrates how to check against the phone numbers in the database.

In [59]:  data = pd.DataFrame(range(10), columns=['number'])

In [60]: data
Out[60]: 
   number
0       0
1       1
2       2
3       3
4       4
5       5
6       6
7       7
8       8
9       9

In [61]: check = range(5)

In [62]: check
Out[62]: [0, 1, 2, 3, 4]

In [63]: data = data[~data['number'].isin(check)]

In [64]: data
Out[64]: 
   number
5       5
6       6
7       7
8       8
9       9

Replace data with your upload table and check will need to be a queried list of phone numbers from your database.

Drop duplicate Pandas and SQLAlchemy

Answers (1)

Related Questions