Scala : Dataframe Merge

Question

I have two dataframes in scala, i created them using the sql queries via hive context, please see the df as images here

Another dataframe is as

Please ignore the repetition of headers in second df, i want to compare the Skill column present in both the data frames and get the equivalent Role, Skill2 and emerging in df1 i.e. demand_df,

I tried this in pandas and was able to achieve by using the following code segment

df1 = pd.DataFrame([
    ["INDIA", "XXX","developer","UNKNOWN",121],
    ["INDIA", "XXXX","software engineer","UNKNOWN",121],
    ["POLAND","XX","english","KNOWN",12‌2]], 
    columns=['country', 'level','Skill','r2d2','tax']) 

df2 = pd.DataFrame([
    ["english", "NaN","teacher","NaN","NaN"],
    [20000,"Unknown","NaN","NaN","N‌aN"],
    ["microsoft","K‌nown","Software Engineer","Microsoft","Enterprise"]], 
    columns=['Skill', 'R2D2','Role','Skill2','Emerging'])

result= df1.merge(df2[['Skill','Role','Skill2','emerging']], how = 'left', left_on = 'Skill', right_on = 'Skill')

Please guide me,as I am new to scala

kishan singh · Accepted Answer

Since you have created two dataframes and want to join two on the basis of skill and create a new dataframe which has df1 and Role, Skill2 and emerging from df2. You can do it through sqlcontext. val sqlContext = new org.apache.spark.sql.SQLContext(sc)

register both dataframe as temptable with the command :

df1.registerTempTable("df1")

df2.registerTempTable("df2")

after that you use simple hive query to join and fetch the required columns from data frame :

val df3 = sqlContext.sql("select a.,b. from df1 a left join df2 b on (a.skill = b.skill)")

Scala : Dataframe Merge

Answers (1)

Related Questions