I have to remove duplicate rows from random collections of data. The data is retrieved from the database as a JSONArray object by an api and can come any number/type of columns. For example while in one case it could be like: TimeStamp Id Value Person --------- ---- ----- ------ 01-01-2018 14:22:16 12 0.3 Mac 01-01-2018 14:32:16 11 0.1 Arya 01-01-2018 14:32:16 2 0.8 Stephen 01-01-2018 14:32:16 11 0.1 Arya In another case it might be something like: Department Floor Budget --------- ----- -------- Sales 26 100000 Finance 25 200000 Finance 25 200000 Production 1 9000000 I need to write a generic method that can remove duplicate from any such collection. Fornunately,in most cases, the duplicate rows can be identified by 2 or 3 columns. (For example, in the first example set, if TimeStamp and Id are same, the row can be considered a duplicate) Right now , I can think of only the naive way of doing this: Write a method, that Takes the input data and the list of columns that identifies a duplicate row as arguments For each row creates a string that is concat of the unique row identifier columns Keeps adding the above string to a set. For any row, if the above string is already in the set(duplicate row), continue, else add the row to the output array. Basically, write a method like : public JSONArray removeDuplicateRows(JSONArray input,List<String> dupCriteriaColumns){ JSONArray opAray = new JSONArray(); HashSet<String> uniqueRows = new HashSet<String>(); for (int i = 0; i < input.length(); i++) { StringBuffer uniqueColString = new StringBuffer(); JSONObject currRow = input.getJSONObject(i); for (String column : dupCriteriaColumns){ uniqueColString.add(currRow.getString(column)); } if(uniqueRows.contains(uniqueColString)){ // Duplicate row continue; }else{ opAray.addJSONObject(currRow); uniqueRows.add(uniqueColString); } } return opAray; } Of course if the number of "unique row identifier columns" are too many, I will end up with large String objects and this solution will suck. EDIT 1: I would prefer not to create POJO for each data set, because type of data sets are too many,their structure changes frequently, and new types get added/removed very often. Can anyone suggest me a better idea/logic?

Reputation: 2725

Generic method to remove duplicate from any collection of data

I have to remove duplicate rows from random collections of data.
The data is retrieved from the database as a JSONArray object by an api and can come any number/type of columns.

For example while in one case it could be like:

TimeStamp              Id   Value   Person
---------             ----  -----   ------
01-01-2018 14:22:16    12    0.3     Mac
01-01-2018 14:32:16    11    0.1     Arya
01-01-2018 14:32:16     2    0.8     Stephen
01-01-2018 14:32:16    11    0.1     Arya

In another case it might be something like:

Department     Floor    Budget
---------      -----   --------
  Sales         26      100000
 Finance        25      200000 
 Finance        25      200000
Production       1      9000000

I need to write a generic method that can remove duplicate from any such collection.

Fornunately,in most cases, the duplicate rows can be identified by 2 or 3 columns. (For example, in the first example set, if TimeStamp and Id are same, the row can be considered a duplicate)

Right now , I can think of only the naive way of doing this:

Write a method, that

Takes the input data and the list of columns that identifies a duplicate row as arguments
For each row creates a string that is concat of the unique row identifier columns
Keeps adding the above string to a set.
For any row, if the above string is already in the set(duplicate row), continue, else add the row to the output array.

Basically, write a method like :

public JSONArray removeDuplicateRows(JSONArray input,List<String> dupCriteriaColumns){

    JSONArray opAray = new JSONArray();
    HashSet<String>  uniqueRows = new HashSet<String>();

    for (int i = 0; i < input.length(); i++) {
        StringBuffer uniqueColString = new StringBuffer();
        JSONObject currRow = input.getJSONObject(i); 

        for (String column : dupCriteriaColumns){
            uniqueColString.add(currRow.getString(column));                         
        }

        if(uniqueRows.contains(uniqueColString)){
            // Duplicate row
            continue;
        }else{
            opAray.addJSONObject(currRow);
            uniqueRows.add(uniqueColString);
        }       

    }
    return opAray;

}

Of course if the number of "unique row identifier columns" are too many, I will end up with large String objects and this solution will suck.

EDIT 1: I would prefer not to create POJO for each data set, because type of data sets are too many,their structure changes frequently, and new types get added/removed very often.

Can anyone suggest me a better idea/logic?

Upvotes: 0

Answers (2)

Florin M

Reputation: 189

If you control the data retrieval, then I suggest to remove duplicates at DB level.

If not, I suggest to write specific methods for each type of data that you have, i.e. transform JSONArray to a specific type, and handle the duplicates there.

Upvotes: 0

metodski

Reputation: 555

I would go through my entities and add proper equals() and hashCode() logic (based on the fields i want to be unique).

Then when retrieving the entities, i would just retrieve them in a set.

What that will do is it will effectively remove the duplicates in the set as the set uses equals and hashCode to keep unique objects.

Upvotes: 4

Generic method to remove duplicate from any collection of data

Answers (2)

Related Questions