acces tuple inside a tuple for anonymous map job in Spark

Question

This post is essentially about how to build joint and marginal histograms from a (String, String) RDD. I posted the code that I eventually used below as the answer.

I have an RDD that contains a set of tuples of type (String,String) and since they aren't unique I want to get a look at how many times each String, String combination occurs so I use countByValue like so

val PairCount = Pairs.countByValue().toSeq

which gives me a tuple as output like this ((String,String),Long) where long is the number of times that the (String, String) tuple appeared

These Strings can be repeated in different combinations and I essentially want to run word count on this PairCount variable so I tried something like this to start:

PairCount.map(x => (x._1._1, x._2))

But the output the this spits out is String1->1, String2->1, String3->1, etc.

How do I output a key value pair from a map job in this case where the key is going to be one of the String values from the inner tuple, and the value is going to be the Long value from the outter tuple?

Update: @vitalii gets me almost there. the answer gets me to a Seq[(String,Long)], but what I really need is to turn that into a map so that I can run reduceByKey it afterwards. when I run

PairCount.flatMap{case((x,y),n) => Seq[x->n]}.toMap

for each unique x I get x->1

for example the above line of code generates mom->1 dad->1 even if the tuples out of the flatMap included (mom,30) (dad,59) (mom,2) (dad,14) in which case I would expect toMap to provide mom->30, dad->59 mom->2 dad->14. However, I'm new to scala so I might be misinterpreting the functionality.

how can I get the Tuple2 sequence converted to a map so that I can reduce on the map keys?

Semicolons and Duct Tape · Accepted Answer

to get the histograms for the (String,String) RDD I used this code.

val Hist_X  = histogram.map(x => (x._1-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_Y  = histogram.map(x => (x._2-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_XY = histogram.map(x => (x-> 1.0)).reduceByKey(_+_)

where histogram was the (String,String) RDD

acces tuple inside a tuple for anonymous map job in Spark

Answers (2)

Related Questions