victorming888
victorming888

Reputation: 121

How to create a co-occurrence matrix from a Spark RDD

I want to create a co-occurrence matrix from some tuples, see below:

val rdd = sc.parallelize(Array(
  Array("101","103","105"), Array("102","105"),Array("101","102","103","105"))
val coocMatrix = new ArrayBuffer[(String, Int)]()

// map
rdd.collect().foreach(x => 
  { for(i <- 0 to x.length-1)
  { for(j <- i+1 to x.length-1)
  { coocMatrix += (x(i)+"#"+x(j), 1) }}

// convert to rdd again
val rdd2 = sc.parallelize(coocMatrix)

// reduce
vall matrix = rdd2.collect().groupByKey()

So we get following data finally

(101#103,2),(101#105,2),(102#105,2),(101#102,1),
(103#105,2),(102#103,1),(102#105,1)

This algorithm is terribly slow because it's O(n*n) and infeasible when there 2 million tuples. Is there any algorithm to calculate this co-occurrence matrix?

Upvotes: 4

Views: 2838

Answers (1)

Christian Hirsch
Christian Hirsch

Reputation: 2056

The combinations method allows you to extract the list of pairs occurring in a given array. After that you can reduceByKey

    rdd.flatMap{_.combinations(2).map{pairs=>(pairs.mkString("#"),1)}}.
    reduceByKey(_+_)

Upvotes: 7

Related Questions