Rebecca
Rebecca

Reputation: 351

Reading/ analysing Json file with about 1TB size in Spark

I would like to analyze a big data (0.9 TB after unzipping) in a cluster with 14 nodes and 39 cores (Azure HDInsight/Spark). But it's very slow. Here what I do:

  1. Data is downloaded from here.

  2. val data = spark.read.json(path) ---- it crashes. Data are stored in HDFS.

  3. val rdd = sc.textFile(path) ... then rdd.count() .... also crashes

  4. rdd.take(10) , ... these are ok

  5. It was not possible to unzip the file; I read it with data.json.gz

Any suggestion? How I can read it with json reader?

Thanks

Upvotes: 1

Views: 1120

Answers (2)

Fenris
Fenris

Reputation: 307

You mention the size after unzipping but also say "It was not possible to unzip the file". If you are reading a compressed file from HDFS, the whole thing will be pulled into memory as it cannot be split. This could be the the lead to OOMEs.

What do you mean exactly by "it crashes"? What exception is being thrown?

Upvotes: 1

roizaig
roizaig

Reputation: 145

You can try loading it all as string by providing manual schema, this should ease the processing.

schema = t.StructType([
    t.StructField("Name", t.StringType(), True),  
    t.StructField("Age", t.StringType(), True),  
    ...
])

df = spark.read \
    .json('path-to-csv', schema=schema)

Upvotes: 1

Related Questions