Arnab Bhagabati
Arnab Bhagabati

Reputation: 2725

Generic method to remove duplicate from any collection of data

I have to remove duplicate rows from random collections of data.
The data is retrieved from the database as a JSONArray object by an api and can come any number/type of columns.

For example while in one case it could be like:

TimeStamp              Id   Value   Person
---------             ----  -----   ------
01-01-2018 14:22:16    12    0.3     Mac
01-01-2018 14:32:16    11    0.1     Arya
01-01-2018 14:32:16     2    0.8     Stephen
01-01-2018 14:32:16    11    0.1     Arya

In another case it might be something like:

Department     Floor    Budget
---------      -----   --------
  Sales         26      100000
 Finance        25      200000 
 Finance        25      200000
Production       1      9000000

I need to write a generic method that can remove duplicate from any such collection.

Fornunately,in most cases, the duplicate rows can be identified by 2 or 3 columns. (For example, in the first example set, if TimeStamp and Id are same, the row can be considered a duplicate)


Right now , I can think of only the naive way of doing this:

Write a method, that

  1. Takes the input data and the list of columns that identifies a duplicate row as arguments
  2. For each row creates a string that is concat of the unique row identifier columns
  3. Keeps adding the above string to a set.
  4. For any row, if the above string is already in the set(duplicate row), continue, else add the row to the output array.

Basically, write a method like :

public JSONArray removeDuplicateRows(JSONArray input,List<String> dupCriteriaColumns){

    JSONArray opAray = new JSONArray();
    HashSet<String>  uniqueRows = new HashSet<String>();

    for (int i = 0; i < input.length(); i++) {
        StringBuffer uniqueColString = new StringBuffer();
        JSONObject currRow = input.getJSONObject(i); 

        for (String column : dupCriteriaColumns){
            uniqueColString.add(currRow.getString(column));                         
        }

        if(uniqueRows.contains(uniqueColString)){
            // Duplicate row
            continue;
        }else{
            opAray.addJSONObject(currRow);
            uniqueRows.add(uniqueColString);
        }       

    }
    return opAray;

}


Of course if the number of "unique row identifier columns" are too many, I will end up with large String objects and this solution will suck.

EDIT 1: I would prefer not to create POJO for each data set, because type of data sets are too many,their structure changes frequently, and new types get added/removed very often.

Can anyone suggest me a better idea/logic?

Upvotes: 0

Views: 250

Answers (2)

Florin M
Florin M

Reputation: 189

If you control the data retrieval, then I suggest to remove duplicates at DB level.

If not, I suggest to write specific methods for each type of data that you have, i.e. transform JSONArray to a specific type, and handle the duplicates there.

Upvotes: 0

metodski
metodski

Reputation: 555

I would go through my entities and add proper equals() and hashCode() logic (based on the fields i want to be unique).

Then when retrieving the entities, i would just retrieve them in a set.

What that will do is it will effectively remove the duplicates in the set as the set uses equals and hashCode to keep unique objects.

Upvotes: 4

Related Questions