Reputation: 1387
I am trying to make a script for a call centre that wishes to be able to upload millions of records from a csv file to a database, filtering out all duplicate phone numbers from the upload. To do this I am using Pandas and SQLAlchemy
df = read_csv('test.csv')
rd = models.session.query(Test).all()
I know there is the drop_duplicates()
in pandas but can only find examples of removing duplicates from the same dataframe. Is this even applicable in my case
UPDATE:
This is what I have so far thanks to the help of others.
df = read_csv('phones.csv')
result_dict = [u.__dict__ for u in models.session.query(Dedupe).all()]
df['tel'] = df.index
rd = DataFrame.from_dict(result_dict)
print df[~df['tel'].isin(rd['tel'].unique())]
It is still printing out all of the csv. Even if there are duplicates
Upvotes: 1
Views: 664
Reputation: 2145
I think this illustrates how to check against the phone numbers in the database.
In [59]: data = pd.DataFrame(range(10), columns=['number'])
In [60]: data
Out[60]:
number
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
In [61]: check = range(5)
In [62]: check
Out[62]: [0, 1, 2, 3, 4]
In [63]: data = data[~data['number'].isin(check)]
In [64]: data
Out[64]:
number
5 5
6 6
7 7
8 8
9 9
Replace data with your upload table and check will need to be a queried list of phone numbers from your database.
Upvotes: 2