Only adding new rows from a dataframe to a csv file

Question

Each day I get a pandas dataframe that has five columns called column1, column2, column3, column4, column5. I want to add rows that I previously did not receive to a file where I keep the unique rows, called known_data.csv. In order to do so, I wrote some code that should

Load the data from known_data.csv as a dataframe called existing_data
Add a new column called 'existing' to the existing_data df
Merge the old existing_data dataframe with the dataframe called new_data on the five columns
Check whether new_data contains new rows by looking at merge[merge.existing.isnull()] (the complement of the new data and the existing data)
Append the new rows to the known_data.csv file

My code looks like this

existing_data = pd.read_csv("known_data.csv")

existing_data['existing'] = 'yes'

merge_data = pd.merge(new_data, existing_data, on = ['column1', 'column2', 'column3', 'column4', 'column5'], how = 'left')

complement = merge_data[merge_data.existing.isnull()]

del complement['existing']

complement.to_csv("known_data.csv", mode='a', index=False,
                  header=False)

Unfortunately, this code does not function as expected: the complement is never empty. Even when I receive data that has already been recorded in known_data.csv, some of the rows of new_data are being appended to the file anyways.

Question: What am I doing wrong? How can I solve this problem? Does it have to do with the way I'm reading the file and write to the file?

Edit: Adding a new column called existing to the existing_data dataframe is probably not the best way of checking the complement between existing_data and new_data. If anyone has a better suggestion that would be hugely appreciated!

Edit2: The problem was that although the dataframes looked identical, there were some values that were of a different type. Somehow this error only showed when I tried to merge a subset of the new dataframe for which this was the case.

julian · Accepted Answer

I think what you are looking for is a concat operation followed by a drop duplicate.

# Concat the two dataframes into a new dataframe holding all the data (memory intensive):
complement = pd.concat([existing_data, new_data], ignore_index=True)

# Remove all duplicates:
complement.drop_duplicates(inplace=True, keep=False)

This will first create a dataframe holding all the old and new data and in a second step will delete all duplicate entries. You can also specify certain columns on which to compare the duplicate values only!

See the documentation here:

concat
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

drop_duplicates
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

Only adding new rows from a dataframe to a csv file

Answers (1)

Related Questions