How to write Huge Data ( almost 800 GB) as a hive orc table in HDFS using SPARK?

Question

I am working in Spark Project since last 3-4 months and recently.

I am doing some calculation with a huge history file (800 GB) and a small incremental file (3 GB).

The calculation is happening very fast in spark using hqlContext & dataframe, but when I am trying to write the calculated result as a hive table with orc format which will contain almost 20 billion of records with a data size of almost 800 GB is taking too much time (more than 2 hours and finally getting failed).

My cluster details are: 19 nodes , 1.41 TB of Total Memory, Total VCores are 361.

For tuneup I am using

--num-executors 67
--executor-cores 6
--executor-memory 60g
--driver-memory 50g
--driver-cores 6
--master yarn-cluster
--total-executor-cores 100
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC"

at run time.

If I take a count of result, then it is completing within 15 minutes, but if I want to write that result in HDFS as hive table.

[ UPDATED_RECORDS.write.format("orc").saveAsTable("HIST_ORC_TARGET") ]

then I am facing the above issue.

Please provide me with a suggestion or anything regarding this as I am stuck in this case since last couple of days.

Code format:

val BASE_RDD_HIST = hqlContext.sql("select * from hist_orc")
val BASE_RDD_INCR = hqlContext.sql("select * from incr_orc")

some spark calculation using dataframe, hive query & udf.....

Finally:

result.write.format("orc").saveAsTable("HIST_ORC_TARGET_TABLE")

How to write Huge Data ( almost 800 GB) as a hive orc table in HDFS using SPARK?

Answers (1)

Related Questions