Reputation: 53
I have a json file that contains 500k Objects and this is its format:
"WORKORDER": [
{
"Attributes": {
"SITEID": {
"content": "BEDFORD"
},
"WONUM": {
"content": "1000"
},
"WOPRIORITY": {
"content": 2
},
"WORKTYPE": {
"content": "CM"
}
}
},
{
"Attributes": {
"SITEID": {
"content": "BEDFORD"
},
"WONUM": {
"content": "1000-10"
},
"WORKTYPE": {
"content": "CM"
}
}
}
Im geting the distinct values like this :
for (int i = 0; i < WORKORDER.length(); i++) {
JSONObject obj = WORKORDER.getJSONObject(i);
JSONObject att = obj.getJSONObject("Attributes");
if( att.has(col)){ // getting col from params in the servlet
JSONObject column = att.getJSONObject(col);
Object colval = column.get("content");
if(!(list.contains(colval))) {
out.println( colval);
list.add(colval);
}
But it takes long time for only 5000 objects !
Is there any way to get the distinct values of any column without parsing the whole Json file, otherwise parsing only the column needed.
Upvotes: 1
Views: 7008
Reputation: 14572
You are iterating on a JSON with 500k elements. For each element you check if it was previously added in a List
. That means your logic will iterate the list 500k times.
Instead, you should use a HashSet
, first, the Set
prevent duplicated value. So you just need to set.add(value)
but the most interesting is the fact that the instance have a constant complexity value. Since it used buckets to organize the value, it doesn't have to iterate the Set
fully.
You can read more about that in amit answer's about How can a HashSet offer constant time add operation?
Note that HashSet gives amortized and average time performance of O(1), not worst case. This means, we can suffer an O(n) operation from time to time. So, when the bins are too packed up, we just create a new, bigger array, and copy the elements to it.
Note that to use any Hash####
implementation, you need to make sure the instance you store implements hashCode
and equals
correctly. You can find out more about this in the community post about What issues should be considered when overriding equals and hashCode in Java?.
Now for the solution :
Set<Object> sets = new HashSet<>();
for (int i = 0; i < WORKORDER.length(); i++) {
// ...
Object colval = column.get("content");
if(sets.add(colval)){ //`add` return true if it wasn't present already.
out.println( colval);
}
}
I kept the Object
type but this should be correctly typed, at least to be sure that those instance are implementing those methods as needed.
colval
being an Object
, it is possible it doesn't implements correctly the methods needed so I suggest you parse it correctly. You should use column.getString("content)
instead or check the instance type.
To validate this, I have used a method to create a fake JSON:
public static JSONObject createDummyJson(int items) {
JSONObject json = new JSONObject();
JSONArray orders = new JSONArray();
json.put("order", orders);
JSONObject attributes;
JSONObject item;
JSONObject order;
Random rand = new Random();
String[] columns = {"columnA", "columnB", "columnC", "columnD"};
for(int i = 0; i < items; ++i) {
order = new JSONObject();
attributes = new JSONObject();
order.put("Attributes", attributes);
orders.put(order);
for(int j = 0; j < rand.nextInt(1000) % columns.length; ++j) {
item= new JSONObject();
long rValue = rand.nextLong();
item.put("content", j%3 == 0 ? ("" + rValue ) : rValue );
attributes.put(columns[j], item);
}
}
return json;
}
Then ran a basic benchmark for both method and had the following results :
public static void main(String[] args) throws Exception {
final int jsonLength = 500_000;
JSONObject json = createDummyJson(jsonLength);
long start = System.currentTimeMillis();
List<Object> list = parseJson(json);
long end = System.currentTimeMillis();
System.out.format("List - Run in %d ms for %d items and output %d lines%n", end-start, jsonLength, list.size());
start = System.currentTimeMillis();
Set<Object> set = parseJsonSet(json);
end = System.currentTimeMillis();
System.out.format("Set - Run in %d ms for %d items and output %d lines%n", end-start, jsonLength, set.size());
}
public static List<Object> parseJson(JSONObject json) {
String col = "columnC";
JSONArray array = json.getJSONArray("order");
List<Object> list = new ArrayList<>();
for (int i = 0; i < array.length(); i++) {
JSONObject obj = array.getJSONObject(i);
JSONObject att = obj.getJSONObject("Attributes");
if (att.has(col)) { // getting col from params in the servlet
JSONObject column = att.getJSONObject(col);
Object colval = column.get("content");
if (!(list.contains(colval))) {
//System.out.println(colval);
list.add(colval);
}
}
}
return list;
}
public static Set<Object> parseJsonSet(JSONObject json) {
String col = "columnC";
JSONArray array = json.getJSONArray("order");
Set<Object> set = new HashSet<>();
for (int i = 0; i < array.length(); i++) {
JSONObject obj = array.getJSONObject(i);
JSONObject att = obj.getJSONObject("Attributes");
if (att.has(col)) { // getting col from params in the servlet
JSONObject column = att.getJSONObject(col);
Object colval = column.get("content");
if (set.add(colval)) {
//System.out.println(colval);
}
}
}
return set;
}
List - Run in 5993 ms for 500000 items and output 46971 lines
Set - Run in 62 ms for 500000 items and output 46971 lines
I even went to a JSON with 5M row (removed the List that would never end)
Set - Run in 6436 ms for 5000000 items and output 468895 lines
Upvotes: 1