Reputation: 39
I am looking for some duplicate matching algorithm in Java.I have senario i.e
I have two tables.Table 1 contain 25,000 records strings within one coloumn and similarly Table 2 contain 20,000 records strings. I want to check duplicate records in both table 1 and table 2. Records are like this format for example:
Table 1
Jhon,voltra
Bruce willis
Table 2
voltra jhon
bruce, willis
Looking for algoirthm which can find this type of duplicate string machting from these two tables in two different files. Can some you help me about two or more algorithm which can perform such queries in Java.
Upvotes: 1
Views: 1348
Reputation: 533530
Read the two files into a normalised form so they can be compared. Use Set of these entries and retainAll()
to find the intersection of these two sets. These are the duplicates.
Upvotes: 5
Reputation: 74058
You can use a Map<String, Integer>
(e.g. HashMap
) and read the files line by line and insert the strings into the map, incrementing the value each time you find an existing entry.
You can then search through your map and find all entries with a count > 1.
Upvotes: 0