Reputation: 67
The first RDD
, user_person
, is a Hive table which records every person's information:
+---------+---+----+
|person_id|age| bmi|
+---------+---+----+
| -100| 1|null|
| 3| 4|null|
...
Below is my second RDD
, a Hive table that only has 40 row and only includes basic information:
| id|startage|endage|energy|
| 1| 0| 0.2| 1|
| 1| 2| 10| 3|
| 1| 10| 20| 5|
I want to compute every person's energy requirement by age scope for each row.
For example,a person's age is 4, so it require 3 energy. I want to add that info into RDD user_person
.
How can I do this?
Upvotes: 0
Views: 78
Reputation: 1912
First, initialize the spark session with enableHiveSupport()
and copy Hive config files (hive-site.xml, core-site.xml, and hdfs-site.xml) to Spark/conf/ directory, to enable Spark to read from Hive.
val sparkSession = SparkSession.builder()
.appName("spark-scala-read-and-write-from-hive")
.config("hive.metastore.warehouse.dir", params.hiveHost + "user/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
Read the Hive tables as Dataframes as below:
val personDF= spark.sql("SELECT * from user_person")
val infoDF = spark.sql("SELECT * from person_info")
Join these two dataframes using below expression:
val outputDF = personDF.join(infoDF, $"age" >= $"startage" && $"age" < $"endage")
The outputDF
dataframe contains all the columns of input dataframes.
Upvotes: 2