Reputation: 15
I have the 2 CSV files as shown down and the results, I want to write code in XXX place that will delete any duplicate rows in results but to keep the row with maximum column values, for example, I want to keep the name a with ADA, not NaN row. my code down there
order phone
name
a aa 1
b bb 2
c cc 3
order phone ad
name
a aa 1 ada
b bb 2 adb
c cc 3 adc
d dd 4 add
order phone ad
name
a aa 1 NaN
b bb 2 NaN
c cc 3 NaN
a aa 1 ada
b bb 2 adb
c cc 3 adc
d dd 4 add
import csv
import os
import pandas as pd
df1 = pd.read_csv(r"1.csv",
index_col=[0], parse_dates=[0])
df2 = pd.read_csv(r"2.csv",
index_col=[0], parse_dates=[0])
print(df1)
print(df2)
finaldf = pd.concat([df1, df2], axis=0, join='outer', sort=False)
df = finaldf.loc[:,~finaldf.columns.duplicated()]
XXXX
print(finaldf)
print(df)
export_csv = df.to_csv(r'3.csv', index = None, header=True)
Upvotes: 0
Views: 534
Reputation: 2870
You can sort the values of the column which has NAs to move NAs to the bottom of the df.
Then using drop_duplicates will keep the first and drop rows which has NAs because they are located last at the bottom.
df.sort_values(by="ad", na_position='last', inplace=True)
df.drop_duplicates(subset=["order", "phone"], keep="first", inplace=True)
Another way of doing it properly without getting NAs.
df = pd.merge(df1, df2, how='outer').set_index(df1.index)
Upvotes: 1