asher baig
asher baig

Reputation: 39

Duplication Algorithm in Java

I am looking for some duplicate matching algorithm in Java.I have senario i.e

I have two tables.Table 1 contain 25,000 records strings within one coloumn and similarly Table 2 contain 20,000 records strings. I want to check duplicate records in both table 1 and table 2. Records are like this format for example:

Table 1

Jhon,voltra

Bruce willis

Table 2

voltra jhon

bruce, willis

Looking for algoirthm which can find this type of duplicate string machting from these two tables in two different files. Can some you help me about two or more algorithm which can perform such queries in Java.

Upvotes: 1

Views: 1348

Answers (2)

Peter Lawrey
Peter Lawrey

Reputation: 533530

Read the two files into a normalised form so they can be compared. Use Set of these entries and retainAll() to find the intersection of these two sets. These are the duplicates.

Upvotes: 5

Olaf Dietsche
Olaf Dietsche

Reputation: 74058

You can use a Map<String, Integer> (e.g. HashMap) and read the files line by line and insert the strings into the map, incrementing the value each time you find an existing entry.

You can then search through your map and find all entries with a count > 1.

Upvotes: 0

Related Questions