Pandas drop_duplicates() not working after add a row to DataFrame when read from a csv file

Question

My code like below:

indexing_file_path = 'indexing.csv'
if not os.path.exists(indexing_file_path):
    df = pd.DataFrame([['1111', '20200101', '20200101'], 
                       ['1112', '20200101', '20200101'], 
                       ['1113', '20200101', '20200101']], 
                       columns = ['nname', 'nstart', 'nend'])
else:
    df = pd.read_csv(indexing_file_path, header = 0)

print(df)
df.loc[len(df)] = ['1113', '20200202', '20200303']
# append() method not working either
print(df)
df.drop_duplicates('nname', keep = 'last', inplace = True)
print(df)
df.to_csv(indexing_file_path, index = False)

I want to keep the nname column unique in this file.

When the code run first time, it will save the records to csv file correctly, although the 1113 is not unique.

When the code run second time, it will save two 1113 rows to the csv file, because the DataFrame is created from a csv file.

After the third time run, it will always keep two 1113 rows.

Now I have a solution:

1, save to csv file with two 1113 row.

2, read the csv file again.

3, use drop_duplicates again.

4, save to csv file again.

Why the DataFrame created from a csv file is so different?

How can I save the unique row to csv file one time?

fish · Accepted Answer

I can answer my question now.

The reason is:

When DataFrame is created from a csv file, pandas recognize the nname column as integer

But, when I add 1113 row again, pandas recognize the new row nname as a string, so the integer 1113 is not equals the string 1113, pandas will keep two row.

The solution is:

Read csv file as string.

df = pd.read_csv(indexing_file_path, header=0, dtype=str)

Pandas drop_duplicates() not working after add a row to DataFrame when read from a csv file

Answers (1)

Related Questions