Find duplicate values in text file via python

Question

Looking for pythonic way of finding duplicate values in a text file.

1||mike||jones||38||first street||2018-05-01
2||michale||jones||38||8th street||2018-05-01
3||mich||jones||38||9th street||2018-05-01
4||mitchel||jones||38||10th street||2018-05-01
1||mike||jones||38||first street||2018-12-01

trying to find duplicate id column and keep most recent? Would I just loop over output insert id's into list then check if value was already in the list?

MIKHIL NAGARALE · Accepted Answer

import pandas as pd
import numpy as np

f= open("sample.txt","w+")
f.write("1||mike||jones||38||first street||2018-05-01
2||michale||jones||38||8th street||2018-05-01
3||mich||jones||38||9th street||2018-05-01
4||mitchel||jones||38||10th street||2018-05-01
1||mike||jones||38||first street||2018-12-01")
f.close()

#read the delimited file with appropriate dataType(numpy.datetime64) for date field
tbl= pd.read_csv("sample.txt",sep='\|\|',names=("id","firstName","lastName","age","address","applicationDate"),dtype={"id":np.int,"firstName":np.str,"lastName":np.str,"age":np.int,"address":np.str,"applicationDate":np.datetime64})


#Note-
#Records with ID=2,3,4 are distinct based on address
#only record with id=1 is dupelicate. Hence source system is taking care of identification of duplicate regestration.
#So We'll only need to identify duplicates based on ID & recent record based on application date(No need to re-implement any logic for dupelicate identification).


for id in set(tbl["id"]):
    #create the temperory dataFrame for rows consist of given id and rank based on value in each field. 
    tempRankDF = tbl.loc[tbl["id"]==id].rank(ascending=False)

    #Note- rank function will calculate rank for each field based on it's dataType. 
    #Hense we used dataType for field "appilcationDate"=numpy.datetime64. 
    #So that when we calculate the rank in descending order on "applicationDate" then recent record will have rank==1

    #Get the index of recent record wrt original dataFrame
    recentRowIndex = tempRankDF.loc[tempRankDF["applicationDate"]==1].index[0]

    print(tbl.iloc[recentRowIndex])


#Note: Update the code inside for loop as per your convinence to write final resultset to either file or another dataFrame or to the database.
#You can directly execute this code & check the resultset.

Find duplicate values in text file via python

Answers (2)

Related Questions