Reputation: 244
Each day I get a pandas dataframe that has five columns called column1
, column2
, column3
, column4
, column5
. I want to add rows that I previously did not receive to a file where I keep the unique rows, called known_data.csv
. In order to do so, I wrote some code that should
Load the data from known_data.csv
as a dataframe called existing_data
Add a new column called 'existing' to the existing_data
df
Merge the old existing_data
dataframe with the dataframe called new_data
on the five columns
Check whether new_data
contains new rows by looking at merge[merge.existing.isnull()]
(the complement of the new data and the existing data)
Append the new rows to the known_data.csv
file
My code looks like this
existing_data = pd.read_csv("known_data.csv")
existing_data['existing'] = 'yes'
merge_data = pd.merge(new_data, existing_data, on = ['column1', 'column2', 'column3', 'column4', 'column5'], how = 'left')
complement = merge_data[merge_data.existing.isnull()]
del complement['existing']
complement.to_csv("known_data.csv", mode='a', index=False,
header=False)
Unfortunately, this code does not function as expected: the complement is never empty. Even when I receive data that has already been recorded in known_data.csv
, some of the rows of new_data
are being appended to the file anyways.
Question: What am I doing wrong? How can I solve this problem? Does it have to do with the way I'm reading the file and write to the file?
Edit: Adding a new column called existing
to the existing_data
dataframe is probably not the best way of checking the complement between existing_data
and new_data
. If anyone has a better suggestion that would be hugely appreciated!
Edit2: The problem was that although the dataframes looked identical, there were some values that were of a different type. Somehow this error only showed when I tried to merge a subset of the new dataframe for which this was the case.
Upvotes: 5
Views: 1469
Reputation: 471
I think what you are looking for is a concat operation followed by a drop duplicate.
# Concat the two dataframes into a new dataframe holding all the data (memory intensive):
complement = pd.concat([existing_data, new_data], ignore_index=True)
# Remove all duplicates:
complement.drop_duplicates(inplace=True, keep=False)
This will first create a dataframe holding all the old and new data and in a second step will delete all duplicate entries. You can also specify certain columns on which to compare the duplicate values only!
See the documentation here:
concat
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
drop_duplicates
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
Upvotes: 2