Reputation: 3689
Given a List[Int]
in Scala, I wish to get the Set[Int]
of all Int
s which appear at least thresh
times. I can do this using groupBy
or foldLeft
, then filter
. For example:
val thresh = 3
val myList = List(1,2,3,2,1,4,3,2,1)
myList.foldLeft(Map[Int,Int]()){case(m, i) => m + (i -> (m.getOrElse(i, 0) + 1))}.filter(_._2 >= thresh).keys
will give Set(1,2)
.
Now suppose the List[Int]
is very large. How large it's hard to say but in any case this seems wasteful as I don't care about each of the Int
s frequencies, and I only care if they're at least thresh
. Once it passed thresh
there's no need to check anymore, just add the Int
to the Set[Int]
.
The question is: can I do this more efficiently for a very large List[Int]
,
a) if I need a true, accurate result (no room for mistakes)
b) if the result can be approximate, e.g. by using some Hashing trick or Bloom Filters, where Set[Int]
might include some false-positives, or whether {the frequency of an Int
> thresh
} isn't really a Boolean
but a Double
in [0-1]
.
Upvotes: 2
Views: 2259
Reputation: 7162
You can change your foldLeft
example a bit using a mutable.Set
that is build incrementally and at the same time used as filter for iterating over your Seq
by using withFilter
. However, because I'm using withFilter
i cannot use foldLeft
and have to make do with foreach
and a mutable map:
import scala.collection.mutable
def getItems[A](in: Seq[A], threshold: Int): Set[A] = {
val counts: mutable.Map[A, Int] = mutable.Map.empty
val result: mutable.Set[A] = mutable.Set.empty
in.withFilter(!result(_)).foreach { x =>
counts.update(x, counts.getOrElse(x, 0) + 1)
if (counts(x) >= threshold) {
result += x
}
}
result.toSet
}
So, this would discard items that have already been added to the result set while running through the Seq
the first time, because withFilter
filters the Seq
in the appended function (map, flatMap, foreach
) rather than returning a filtered Seq
.
EDIT:
I changed my solution to not use Seq.count
, which was stupid, as Aivean correctly pointed out.
Using Aiveans microbench I can see that it is still slightly slower than his approach, but still better than the authors first approach.
Authors solution
377
Ixx solution:
399
Ixx solution2 (eliminated excessive map writes):
110
Sascha Kolbergs solution:
72
Aivean solution:
54
Upvotes: 0
Reputation: 10882
First of all, you can't do better than O(N), as you need to check each element of your initial array at least once. You current approach is O(N), presuming that operations with IntMap
are effectively constant.
Now what you can try in order to increase efficiency:
Array
instead of IntMap
(index as the key). Another possible option will be mutable HashMap
with sufficient initail capacity. As my benchmark shows it actually makes significant differenceI don't see how any approximate solution can be faster (only if you ignore some elements at random). Otherwise it will still be O(N).
Update
I created microbenchmark to measure the actual performance of different implementations. For sufficiently large input and output Ixx's suggestion regarding immediately adding elements to result list doesn't produce significant improvement. However similar approach could be used to eliminate unnecessary Map updates (which appears to be the most expensive operation).
Results of benchmarks (avg run times on 1000000 elems with pre-warming):
Authors solution:
447 ms
Ixx solution:
412 ms
Ixx solution2 (eliminated excessive map writes):
150 ms
My solution:
57 ms
My solution involves using mutable HashMap
instead of immutable IntMap
and includes all other possible optimizations.
Ixx's updated solution:
val tuple = (Map[Int, Int](), List[Int]())
val res = myList.foldLeft(tuple) {
case ((m, s), i) =>
val count = m.getOrElse(i, 0) + 1
(if (count <= 3) m + (i -> count) else m, if (count == thresh) i :: s else s)
}
My solution:
val map = new mutable.HashMap[Int, Int]()
val res = new ListBuffer[Int]
myList.foreach {
i =>
val c = map.getOrElse(i, 0) + 1
if (c == thresh) {
res += i
}
if (c <= thresh) {
map(i) = c
}
}
The full microbenchmark source is available here.
Upvotes: 3
Reputation: 1749
If by "more efficiently" you mean the space efficiency (in extreme case when the list is infinite), there's a probabilistic data structure called Count Min Sketch to estimate the frequency of items inside it. Then you can discard those with frequency below your threshold.
There's a Scala implementation from Algebird library.
Upvotes: 1
Reputation: 32269
You could use the foldleft
to collect the matching items, like this:
val tuple = (Map[Int,Int](), List[Int]())
myList.foldLeft(tuple) {
case((m, s), i) => {
val count = (m.getOrElse(i, 0) + 1)
(m + (i -> count), if (count == thresh) i :: s else s)
}
}
I could measure a performance improvement of about 40% with a small list, so it's definitely an improvement...
Edited to use List
and prepend, which takes constant time (see comments).
Upvotes: 1