Reputation: 11
I am new of Spark 1.6. I'd like read an parquet file and process it. For simplify suppose to have a parquet with this structure:
id, amount, label
and I have 3 rule:
amount < 10000 => label=LOW
10000 < amount < 100000 => label=MEDIUM
amount > 1000000 => label = HIGH
How can do it in spark and scala?
I try something like that:
case class SampleModels(
id: String,
amount: Double,
label: String,
)
val sc = SparkContext.getOrCreate()
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = sqlContext.read.parquet("/path/file/")
val ds = df.as[SampleModels].map( row=>
MY LOGIC
WRITE OUTPUT IN PARQUET
)
Is it right approach? Is it efficient? "MYLOGIC" could be more complex.
Thanks
Upvotes: 1
Views: 7703
Reputation: 1992
Yes, it is correct approach. It will do one pass over your complete data to build the extra column you need.
If you want a sql way, this is the way to go,
val df = sqlContext.read.parquet("/path/file/")
df.registerTempTable("MY_TABLE")
val df2 = sqlContext.sql("select *, case when amount < 10000 then LOW else HIGH end as label from MY_TABLE")
Remember to use hiveContext instead of sparkContext though.
Upvotes: 0
Reputation: 13936
Yes, it's the right way to work with spark. If your logic is simple, you can try to use built-in functions to operate on dataframe directly (like when in your case), it will be a little faster than mapping rows to to case class and executing code in jvm and you will be able to save the results back to parquet easily.
Upvotes: 1