Reputation: 11

spark: read parquet file and process it

I am new of Spark 1.6. I'd like read an parquet file and process it. For simplify suppose to have a parquet with this structure:

id, amount, label

and I have 3 rule:

amount < 10000 => label=LOW
10000 < amount < 100000 => label=MEDIUM
amount > 1000000 => label = HIGH

How can do it in spark and scala?

I try something like that:

case class SampleModels(
  id: String,
  amount: Double,
  label: String,
)

val sc = SparkContext.getOrCreate()
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._

val df = sqlContext.read.parquet("/path/file/")
val ds = df.as[SampleModels].map( row=>
    MY LOGIC 
    WRITE OUTPUT IN PARQUET
)

Is it right approach? Is it efficient? "MYLOGIC" could be more complex.

Thanks

Upvotes: 1

Answers (2)

Abhishek Anand

Reputation: 1992

Yes, it is correct approach. It will do one pass over your complete data to build the extra column you need.

If you want a sql way, this is the way to go,

val df = sqlContext.read.parquet("/path/file/")
df.registerTempTable("MY_TABLE")
val df2 = sqlContext.sql("select *, case when amount < 10000 then LOW else HIGH end as label from MY_TABLE")

Remember to use hiveContext instead of sparkContext though.

Upvotes: 0

Mariusz

Reputation: 13936

Yes, it's the right way to work with spark. If your logic is simple, you can try to use built-in functions to operate on dataframe directly (like when in your case), it will be a little faster than mapping rows to to case class and executing code in jvm and you will be able to save the results back to parquet easily.

Upvotes: 1

spark: read parquet file and process it

Answers (2)

Related Questions