Reputation: 830
Like an example below, I'd like to accumulate values by key.
I can use List
, ArrayBuffer
, Array
, mutable.HashSet
, etc.
When the number of values for each key is large varied and unknown number, i.e wide (e.g, 10k - 1M), which data structure is most efficient?
Definitely, in Java, I avoid to use List
or Vector
due to memory dynamic expansion. In Scala, performance-wise and/or memory-wise what is best practice?
Thanks.
val res = data.flatMap{ x =>
if ( some condition )
Some(( x._2._2, ArrayBuffer[(Int, Double)]( x._1,, x._2._1)) ) )
} else {
None
}
}
.reduceByKey {(x, y) => x ++ y}
UPDATE: The subsequent transforms are as below on Spark. I'm creating feature matrix (using sparse vector) as data prep.
.map(x => (x._1, x._2.toArray.sortBy(_._1 )) )
.map { x => (yieldMap.value.get(x._1).get , x._2.map(_._1), x._2.map(_._2)) }
Upvotes: 0
Views: 557
Reputation: 2337
You seem to be using spark, so I assume you want to compute this stuff on a cluster somehow? When doing distributed computing, the question how you distribute and how much communication is needed between cluster nodes is the most important.
The fastest approach probably would be to map each key to a cluster node and then aggregate the results sequentially into a list. From looking at the API you can achieve the mapping to cluster nodes using a Partitioner and the aggregation using aggregateByKey. AggregateByKey allows you to specify a function that is applied in linear order over the data in on partition, so you can aggregate all values effectively into a list. You also have to specify a associative aggregate function but it does not matter how efficient it is because it will never be called.
If you stick with what you have, without knowing of being able to assume anything of the order in which the reduce function is called, a plain Array might actually be the best data structure. Lists might be faster if you are prepending elements, but you cannot ensure that. Vectors on the other hand have effectively constant time for appending and prepending an element, but merging of two vectors of similar size should be linear anyway, and the constants involved with vectors are larger. If you have an efficiency problem with what you are doing now, I would really try to use aggregate together with an optimal partitioning of your data.
Upvotes: 1
Reputation: 9698
Well, if you accumulate them for quick access, then of course you need something that provides O(1) lookup (such as HashMap
). From your example I can see that you want to reduce by key in a later stage, which means you need to traverse it anyway.
List is OK if you need to append only to head of the collection. In that case make a ListBuffer
, fill it up incrementally and then invoke .toList()
when you're done adding. That will save you some memory.
If you don't append only to head, take a Vector
. It is effectively constant time due to its tree representation (see here) and is generally recommended over lists if performance is an issue.
Here's a performance overview that might be of help.
Upvotes: 1