Reading/ analysing Json file with about 1TB size in Spark

Question

I would like to analyze a big data (0.9 TB after unzipping) in a cluster with 14 nodes and 39 cores (Azure HDInsight/Spark). But it's very slow. Here what I do:

Data is downloaded from here.
val data = spark.read.json(path) ---- it crashes. Data are stored in HDFS.
val rdd = sc.textFile(path) ... then rdd.count() .... also crashes
rdd.take(10) , ... these are ok
It was not possible to unzip the file; I read it with data.json.gz

Any suggestion? How I can read it with json reader?

Thanks

Fenris · Accepted Answer

You mention the size after unzipping but also say "It was not possible to unzip the file". If you are reading a compressed file from HDFS, the whole thing will be pulled into memory as it cannot be split. This could be the the lead to OOMEs.

What do you mean exactly by "it crashes"? What exception is being thrown?

Reading/ analysing Json file with about 1TB size in Spark

Answers (2)

Related Questions