joshsuihn
joshsuihn

Reputation: 830

Efficient data structure for aggregation in Scala

Like an example below, I'd like to accumulate values by key. I can use List, ArrayBuffer, Array, mutable.HashSet, etc. When the number of values for each key is large varied and unknown number, i.e wide (e.g, 10k - 1M), which data structure is most efficient?

Definitely, in Java, I avoid to use List or Vector due to memory dynamic expansion. In Scala, performance-wise and/or memory-wise what is best practice?

Thanks.

val res = data.flatMap{ x =>
          if ( some condition )
            Some(( x._2._2, ArrayBuffer[(Int, Double)]( x._1,, x._2._1)) ) )
          } else {
            None
          }
        }
        .reduceByKey {(x, y) => x ++ y}

UPDATE: The subsequent transforms are as below on Spark. I'm creating feature matrix (using sparse vector) as data prep.

.map(x => (x._1, x._2.toArray.sortBy(_._1 )) )
.map { x => (yieldMap.value.get(x._1).get  , x._2.map(_._1), x._2.map(_._2)) }

Upvotes: 0

Views: 557

Answers (2)

dth
dth

Reputation: 2337

You seem to be using spark, so I assume you want to compute this stuff on a cluster somehow? When doing distributed computing, the question how you distribute and how much communication is needed between cluster nodes is the most important.

The fastest approach probably would be to map each key to a cluster node and then aggregate the results sequentially into a list. From looking at the API you can achieve the mapping to cluster nodes using a Partitioner and the aggregation using aggregateByKey. AggregateByKey allows you to specify a function that is applied in linear order over the data in on partition, so you can aggregate all values effectively into a list. You also have to specify a associative aggregate function but it does not matter how efficient it is because it will never be called.

If you stick with what you have, without knowing of being able to assume anything of the order in which the reduce function is called, a plain Array might actually be the best data structure. Lists might be faster if you are prepending elements, but you cannot ensure that. Vectors on the other hand have effectively constant time for appending and prepending an element, but merging of two vectors of similar size should be linear anyway, and the constants involved with vectors are larger. If you have an efficiency problem with what you are doing now, I would really try to use aggregate together with an optimal partitioning of your data.

Upvotes: 1

slouc
slouc

Reputation: 9698

Well, if you accumulate them for quick access, then of course you need something that provides O(1) lookup (such as HashMap). From your example I can see that you want to reduce by key in a later stage, which means you need to traverse it anyway.

List is OK if you need to append only to head of the collection. In that case make a ListBuffer, fill it up incrementally and then invoke .toList() when you're done adding. That will save you some memory.

If you don't append only to head, take a Vector. It is effectively constant time due to its tree representation (see here) and is generally recommended over lists if performance is an issue.

Here's a performance overview that might be of help.

Upvotes: 1

Related Questions