user3525290
user3525290

Reputation: 1617

Find duplicate values in text file via python

Looking for pythonic way of finding duplicate values in a text file.

1||mike||jones||38||first street||2018-05-01
2||michale||jones||38||8th street||2018-05-01
3||mich||jones||38||9th street||2018-05-01
4||mitchel||jones||38||10th street||2018-05-01
1||mike||jones||38||first street||2018-12-01

trying to find duplicate id column and keep most recent? Would I just loop over output insert id's into list then check if value was already in the list?

Upvotes: 0

Views: 1750

Answers (2)

MIKHIL NAGARALE
MIKHIL NAGARALE

Reputation: 193

import pandas as pd
import numpy as np

f= open("sample.txt","w+")
f.write("1||mike||jones||38||first street||2018-05-01\n2||michale||jones||38||8th street||2018-05-01\n3||mich||jones||38||9th street||2018-05-01\n4||mitchel||jones||38||10th street||2018-05-01\n1||mike||jones||38||first street||2018-12-01")
f.close()

#read the delimited file with appropriate dataType(numpy.datetime64) for date field
tbl= pd.read_csv("sample.txt",sep='\|\|',names=("id","firstName","lastName","age","address","applicationDate"),dtype={"id":np.int,"firstName":np.str,"lastName":np.str,"age":np.int,"address":np.str,"applicationDate":np.datetime64})


#Note-
#Records with ID=2,3,4 are distinct based on address
#only record with id=1 is dupelicate. Hence source system is taking care of identification of duplicate regestration.
#So We'll only need to identify duplicates based on ID & recent record based on application date(No need to re-implement any logic for dupelicate identification).


for id in set(tbl["id"]):
    #create the temperory dataFrame for rows consist of given id and rank based on value in each field. 
    tempRankDF = tbl.loc[tbl["id"]==id].rank(ascending=False)

    #Note- rank function will calculate rank for each field based on it's dataType. 
    #Hense we used dataType for field "appilcationDate"=numpy.datetime64. 
    #So that when we calculate the rank in descending order on "applicationDate" then recent record will have rank==1

    #Get the index of recent record wrt original dataFrame
    recentRowIndex = tempRankDF.loc[tempRankDF["applicationDate"]==1].index[0]

    print(tbl.iloc[recentRowIndex])


#Note: Update the code inside for loop as per your convinence to write final resultset to either file or another dataFrame or to the database.
#You can directly execute this code & check the resultset.

Upvotes: 1

Kaushal Pahwani
Kaushal Pahwani

Reputation: 474

We have got very powerful library Pandas available to perform analytical operation with minimum lines of code.

Basically pandas is an open source python package that provides numerous tools for data analysis. Some basic advantages and uses of pandas are listed below:

  1. It can present data in a way that is suitable for data analysis.
  2. The package contains multiple methods for convenient data filtering.
  3. It has a variety of utilities to perform Input/Output operations.

Implementaion of the case that you wanna achive using pandas

First install pandas using pip install pandas

i/p > A text file with input data in given format

o/p > A text file with required output in csv format


import pandas as pd
from datetime import datetime

with open("input") as file:     # Read input
    headers = ["id", "first_name", "last_name", "age", "address", "date"]
    dtypes = [int, str, str, int, str, datetime]
    data_frame = pd.read_csv(file, sep='[|][|]', names=headers, header=None,  parse_dates=['date'],
                             engine="python")   # Read data into data frame from csv
    data_frame.sort_values(data_frame.date.name, ascending=False, inplace=True)     # Sort the values based on dates
    data_frame.drop_duplicates(subset=data_frame.id.name, inplace=True)     # Delete duplicate rows based on id
    data_frame.to_csv('output', sep=',', header=None)   # Generate outpu

Upvotes: 1

Related Questions