Reputation: 127
I have two data sets like, (film name, actress's name) and (film name, director's name)
I want to join them by using the name of the film, so (film name, actress's name, director's name).
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
import scala.io.Source
object spark {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("FindFrequentPairs").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text1: RDD[String] = sc.textFile(args(0))
val text2: RDD[String] = sc.textFile(args(1))
val joined = text1.join(text2)
I tried to use 'join' but it says 'cannot resolve symbol join.' Do you have any idea how to join them?
This is part of my datasets, (filme name, actress).
('"Please Like Me" (2013) {Rhubarb and Custard (#1.1)}', '$haniqua')
('"Please Like Me" (2013) {Spanish Eggs (#1.5)}', '$haniqua')
('A Woman of Distinction (1950) (uncredited)', '& Ashour, Lucienne')
('Around the World (1943) (uncredited)', '& Ashour, Lucienne')
('Chain Lightning (1950) (uncredited)', '& Ashour, Lucienne')
Upvotes: 1
Views: 854
Reputation: 2089
You have to create pairRDDs first for your data sets then you have to apply join transformation. Your data sets are not looking accurate.
Please consider the below example.
**Dataset1**
a 1
b 2
c 3
**Dataset2**
a 8
b 4
Your code should be like below in Scala
val pairRDD1 = sc.textFile("/path_to_yourfile/first.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val pairRDD2 = sc.textFile("/path_to_yourfile/second.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val joinRDD = pairRDD1.join(pairRDD2)
joinRDD.collect
Here is the result from scala shell
res10: Array[(String, (String, String))] = Array((a,(1,8)), (b,(2,4)))
Upvotes: 2