Join files in Apache Spark

Question

I have a file like this. code_count.csv

code,count,year
AE,2,2008
AE,3,2008
BX,1,2005
CD,4,2004
HU,1,2003
BX,8,2004

Another file like this. details.csv

code,exp_code
AE,Aerogon international
BX,Bloomberg Xtern
CD,Classic Divide
HU,Honololu

I want the total sum for each code but in the final output, I want the exp_code. Like this

Aerogon international,5
Bloomberg Xtern,4
Classic Divide,4

Here is my code

var countData=sc.textFile("C:\path	o\code_count.csv")
var countDataKV=countData.map(x=>x.split(",")).map(x=>(x(0),1))
var sum=countDataKV.foldBykey(0)((acc,ele)=>{(acc+ele)})
sum.take(2)

gives

Array[(String, Int)] = Array((AE,5), (BX,9))

Here sum is RDD[(String, Int)]. I am kind of confused about how to pull the exp_code from the other file. Please guide.

koiralo · Accepted Answer

You need to calculate the sum after groupby with code and then join another dataframe. Below is similar example.

import spark.implicits._
  val df1 = spark.sparkContext.parallelize(Seq(("AE",2,2008), ("AE",3,2008), ("BX",1,2005), ("CD",4,2004), ("HU",1,2003), ("BX",8,2004)))
    .toDF("code","count","year")

  val df2 = spark.sparkContext.parallelize(Seq(("AE","Aerogon international"),
    ("BX","Bloomberg Xtern"), ("CD","Classic Divide"), ("HU","Honololu"))).toDF("code","exp_code")


  val sumdf1 = df1.select("code", "count").groupBy("code").agg(sum("count"))

  val finalDF = sumdf1.join(df2, "code").drop("code")

finalDF.show()

Join files in Apache Spark

Here is my code

Answers (2)

Related Questions