Xinxi Haj
Xinxi Haj

Reputation: 15

Python Pandas remove the duplicate rows and keep the row with more values in CSV

I have the 2 CSV files as shown down and the results, I want to write code in XXX place that will delete any duplicate rows in results but to keep the row with maximum column values, for example, I want to keep the name a with ADA, not NaN row. my code down there

    order  phone
name             
a       aa      1
b       bb      2
c       cc      3
     order  phone   ad
name                  
a       aa      1  ada
b       bb      2  adb
c       cc      3  adc
d       dd      4  add
     order  phone   ad
name                  
a       aa      1  NaN
b       bb      2  NaN
c       cc      3  NaN
a       aa      1  ada
b       bb      2  adb
c       cc      3  adc
d       dd      4  add   


 import csv
import os
import pandas as pd

df1 = pd.read_csv(r"1.csv",
                  index_col=[0], parse_dates=[0])
df2 = pd.read_csv(r"2.csv",
                  index_col=[0], parse_dates=[0])

print(df1)
print(df2)
finaldf = pd.concat([df1, df2], axis=0, join='outer', sort=False)
df = finaldf.loc[:,~finaldf.columns.duplicated()]
XXXX
print(finaldf)
print(df)

export_csv = df.to_csv(r'3.csv', index = None, header=True)

Upvotes: 0

Views: 534

Answers (1)

Natheer Alabsi
Natheer Alabsi

Reputation: 2870

You can sort the values of the column which has NAs to move NAs to the bottom of the df.
Then using drop_duplicates will keep the first and drop rows which has NAs because they are located last at the bottom.

df.sort_values(by="ad", na_position='last', inplace=True)
df.drop_duplicates(subset=["order", "phone"], keep="first", inplace=True)

Another way of doing it properly without getting NAs.

df = pd.merge(df1, df2, how='outer').set_index(df1.index)

Upvotes: 1

Related Questions