tobby
tobby

Reputation: 127

Join two datasets by using the first column in scala spark

I have two data sets like, (film name, actress's name) and (film name, director's name)

I want to join them by using the name of the film, so (film name, actress's name, director's name).

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

import scala.io.Source

object spark {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("FindFrequentPairs").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)
    val text1: RDD[String] = sc.textFile(args(0))
    val text2: RDD[String] = sc.textFile(args(1))

    val joined = text1.join(text2)

I tried to use 'join' but it says 'cannot resolve symbol join.' Do you have any idea how to join them?

This is part of my datasets, (filme name, actress).

('"Please Like Me" (2013) {Rhubarb and Custard (#1.1)}', '$haniqua')
('"Please Like Me" (2013) {Spanish Eggs (#1.5)}', '$haniqua')
('A Woman of Distinction (1950)  (uncredited)', '& Ashour, Lucienne')
('Around the World (1943)  (uncredited)', '& Ashour, Lucienne')
('Chain Lightning (1950)  (uncredited)', '& Ashour, Lucienne')

Upvotes: 1

Views: 854

Answers (1)

Arunakiran Nulu
Arunakiran Nulu

Reputation: 2089

You have to create pairRDDs first for your data sets then you have to apply join transformation. Your data sets are not looking accurate.

Please consider the below example.

**Dataset1**

a 1
b 2
c 3

**Dataset2**

a 8
b 4

Your code should be like below in Scala

val pairRDD1 = sc.textFile("/path_to_yourfile/first.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))

val pairRDD2 = sc.textFile("/path_to_yourfile/second.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))

val joinRDD = pairRDD1.join(pairRDD2)

joinRDD.collect

Here is the result from scala shell

res10: Array[(String, (String, String))] = Array((a,(1,8)), (b,(2,4)))

Upvotes: 2

Related Questions