Reputation: 68
I have two dataframes in scala, i created them using the sql queries via hive context, please see the df as images here
Please ignore the repetition of headers in second df, i want to compare the Skill column present in both the data frames and get the equivalent Role, Skill2 and emerging in df1 i.e. demand_df,
I tried this in pandas and was able to achieve by using the following code segment
df1 = pd.DataFrame([
["INDIA", "XXX","developer","UNKNOWN",121],
["INDIA", "XXXX","software engineer","UNKNOWN",121],
["POLAND","XX","english","KNOWN",122]],
columns=['country', 'level','Skill','r2d2','tax'])
df2 = pd.DataFrame([
["english", "NaN","teacher","NaN","NaN"],
[20000,"Unknown","NaN","NaN","NaN"],
["microsoft","Known","Software Engineer","Microsoft","Enterprise"]],
columns=['Skill', 'R2D2','Role','Skill2','Emerging'])
result= df1.merge(df2[['Skill','Role','Skill2','emerging']], how = 'left', left_on = 'Skill', right_on = 'Skill')
Please guide me,as I am new to scala
Upvotes: 0
Views: 665
Reputation: 56
Since you have created two dataframes and want to join two on the basis of skill and create a new dataframe which has df1 and Role, Skill2 and emerging from df2. You can do it through sqlcontext. val sqlContext = new org.apache.spark.sql.SQLContext(sc)
register both dataframe as temptable with the command :
df1.registerTempTable("df1")
df2.registerTempTable("df2")
after that you use simple hive query to join and fetch the required columns from data frame :
val df3 = sqlContext.sql("select a.,b. from df1 a left join df2 b on (a.skill = b.skill)")
Upvotes: 1