Reputation: 2725
I have to remove duplicate rows from random collections of data.
The data is retrieved from the database as a JSONArray object by an api and can come any number/type of columns.
For example while in one case it could be like:
TimeStamp Id Value Person
--------- ---- ----- ------
01-01-2018 14:22:16 12 0.3 Mac
01-01-2018 14:32:16 11 0.1 Arya
01-01-2018 14:32:16 2 0.8 Stephen
01-01-2018 14:32:16 11 0.1 Arya
In another case it might be something like:
Department Floor Budget
--------- ----- --------
Sales 26 100000
Finance 25 200000
Finance 25 200000
Production 1 9000000
I need to write a generic method that can remove duplicate from any such collection.
Fornunately,in most cases, the duplicate rows can be identified by 2 or 3 columns. (For example, in the first example set, if TimeStamp
and Id
are same, the row can be considered a duplicate)
Right now , I can think of only the naive way of doing this:
Write a method, that
Basically, write a method like :
public JSONArray removeDuplicateRows(JSONArray input,List<String> dupCriteriaColumns){
JSONArray opAray = new JSONArray();
HashSet<String> uniqueRows = new HashSet<String>();
for (int i = 0; i < input.length(); i++) {
StringBuffer uniqueColString = new StringBuffer();
JSONObject currRow = input.getJSONObject(i);
for (String column : dupCriteriaColumns){
uniqueColString.add(currRow.getString(column));
}
if(uniqueRows.contains(uniqueColString)){
// Duplicate row
continue;
}else{
opAray.addJSONObject(currRow);
uniqueRows.add(uniqueColString);
}
}
return opAray;
}
Of course if the number of "unique row identifier columns" are too many, I will end up with large String objects and this solution will suck.
EDIT 1: I would prefer not to create POJO for each data set, because type of data sets are too many,their structure changes frequently, and new types get added/removed very often.
Can anyone suggest me a better idea/logic?
Upvotes: 0
Views: 250
Reputation: 189
If you control the data retrieval, then I suggest to remove duplicates at DB level.
If not, I suggest to write specific methods for each type of data that you have, i.e. transform JSONArray to a specific type, and handle the duplicates there.
Upvotes: 0
Reputation: 555
I would go through my entities and add proper equals() and hashCode() logic (based on the fields i want to be unique).
Then when retrieving the entities, i would just retrieve them in a set.
What that will do is it will effectively remove the duplicates in the set as the set uses equals and hashCode to keep unique objects.
Upvotes: 4