ankit3j
ankit3j

Reputation: 51

Java Streams: Is the complexity of collecting a stream of long same as filtering it based on Set::contains?

I have an application which accepts employee ids as user input and then filters the employee list for matching ids. User input is supposed to be 3-4 ids and employee list is a few thousands.

I came up with the following 2 methods using Streams filter based on performance concerns.

Method1

Motivation here is to not run filter for each employee, rather run it on the requested ids list which is guaranteed to be very short.

private static Set<Long> identifyEmployees(CustomRequest request)
  List<Long> requestedIds = request.getRequestedIDs();                            
  if (!requestedIds.isEmpty()) {
      Set<Long> allEmployeeIds = 
              employeeInfoProvider
                .getEmployeeInfoList()  // returns List<EmployeeInfo>
                .stream()
                .map(EmployeeInfo::getEmpId)  // getEmpId() returns a Long
                .collect(Collectors.toSet());

      return requestedIds.stream().filter(allEmployeeIds::contains).collect(Collectors.toSet());         
  }
  return Collections.emptySet();
}

Method2

Motivation here is to replace collect() in Method1 with a filter as complexity would be same. collect() here would actually be running on a very small number of elements.

private static Set<Long> identifyEmployees(CustomRequest request)
  Set<Long> requestedIds = request.getRequestedIDs()   // returns List<Long>
                          .stream()
                          .collect(Collectors.toSet());
  if (!requestedIds.isEmpty()) {
      return employeeInfoProvider
               .getEmployeeInfoList()  // returns List<EmployeeInfo>
               .stream()
               .map(EmployeeInfo::getEmpId)  // getEmpId() returns a Long
               .filter(requestedIds::contains)
               .collect(Collectors.toSet());
  }
  return Collections.emptySet();
}

Does Method2 perform as good as Method1? Or does Method1 perform better?

Upvotes: 1

Views: 348

Answers (2)

Nowhere Man
Nowhere Man

Reputation: 19545

A potentially faster (not cleaner) option would be to return immediately when all the requestedIds have been detected, but I'm not sure whether it could be implemented with Stream API.

private static Set<Long> identifyEmployees(CustomRequest request) {
    Set<Long> requestedIds = request.getRequestedIDs()   // returns List<Long>
            .stream()
            .collect(Collectors.toSet());
    Set<Long> result = new HashSet<>();
    if (!requestedIds.isEmpty()) {
        Iterator<EmployeeInfo> employees = employeeInfoProvider.getEmployeeInfoList().iterator();
        while (result.size() < requestedIds.size() && employees.hasNext()) {
            Long employeeId = employees.next().getEmpId();
            if (requestedIds.contains(employeeId)) {
                result.add(employeeId);
            }
        }
    }
    return result;
}

However, it makes sense only if employeeInfoProvider.getEmployeeInfoList() returns multiple duplicates of employees with the same IDs. Otherwise, as mentioned above, the method2 is a better choice.

Upvotes: 0

Louis Wasserman
Louis Wasserman

Reputation: 198033

I would expect Method2 to perform as good or better in all scenarios.

Collecting to an intermediate set adds allocation overhead. It reduces the number of requestedIds::contains calls you have to do later if there are lots of duplicates, but even then, you're exchanging each Set::add call for a Set::contains call, each of which should be a small win.

Upvotes: 2

Related Questions