Reputation: 1617
Looking for pythonic way of finding duplicate values in a text file.
1||mike||jones||38||first street||2018-05-01
2||michale||jones||38||8th street||2018-05-01
3||mich||jones||38||9th street||2018-05-01
4||mitchel||jones||38||10th street||2018-05-01
1||mike||jones||38||first street||2018-12-01
trying to find duplicate id column and keep most recent? Would I just loop over output insert id's into list then check if value was already in the list?
Upvotes: 0
Views: 1750
Reputation: 193
import pandas as pd
import numpy as np
f= open("sample.txt","w+")
f.write("1||mike||jones||38||first street||2018-05-01\n2||michale||jones||38||8th street||2018-05-01\n3||mich||jones||38||9th street||2018-05-01\n4||mitchel||jones||38||10th street||2018-05-01\n1||mike||jones||38||first street||2018-12-01")
f.close()
#read the delimited file with appropriate dataType(numpy.datetime64) for date field
tbl= pd.read_csv("sample.txt",sep='\|\|',names=("id","firstName","lastName","age","address","applicationDate"),dtype={"id":np.int,"firstName":np.str,"lastName":np.str,"age":np.int,"address":np.str,"applicationDate":np.datetime64})
#Note-
#Records with ID=2,3,4 are distinct based on address
#only record with id=1 is dupelicate. Hence source system is taking care of identification of duplicate regestration.
#So We'll only need to identify duplicates based on ID & recent record based on application date(No need to re-implement any logic for dupelicate identification).
for id in set(tbl["id"]):
#create the temperory dataFrame for rows consist of given id and rank based on value in each field.
tempRankDF = tbl.loc[tbl["id"]==id].rank(ascending=False)
#Note- rank function will calculate rank for each field based on it's dataType.
#Hense we used dataType for field "appilcationDate"=numpy.datetime64.
#So that when we calculate the rank in descending order on "applicationDate" then recent record will have rank==1
#Get the index of recent record wrt original dataFrame
recentRowIndex = tempRankDF.loc[tempRankDF["applicationDate"]==1].index[0]
print(tbl.iloc[recentRowIndex])
#Note: Update the code inside for loop as per your convinence to write final resultset to either file or another dataFrame or to the database.
#You can directly execute this code & check the resultset.
Upvotes: 1
Reputation: 474
We have got very powerful library Pandas available to perform analytical operation with minimum lines of code.
Basically pandas is an open source python package that provides numerous tools for data analysis. Some basic advantages and uses of pandas are listed below:
Implementaion of the case that you wanna achive using pandas
First install pandas
using pip install pandas
i/p > A text file with input data in given format
o/p > A text file with required output in csv format
import pandas as pd
from datetime import datetime
with open("input") as file: # Read input
headers = ["id", "first_name", "last_name", "age", "address", "date"]
dtypes = [int, str, str, int, str, datetime]
data_frame = pd.read_csv(file, sep='[|][|]', names=headers, header=None, parse_dates=['date'],
engine="python") # Read data into data frame from csv
data_frame.sort_values(data_frame.date.name, ascending=False, inplace=True) # Sort the values based on dates
data_frame.drop_duplicates(subset=data_frame.id.name, inplace=True) # Delete duplicate rows based on id
data_frame.to_csv('output', sep=',', header=None) # Generate outpu
Upvotes: 1