Reputation: 351
I would like to analyze a big data (0.9 TB after unzipping) in a cluster with 14 nodes and 39 cores (Azure HDInsight/Spark). But it's very slow. Here what I do:
Data is downloaded from here.
val data = spark.read.json(path)
---- it crashes. Data are stored in HDFS.
val rdd = sc.textFile(path)
... then rdd.count()
.... also crashes
rdd.take(10)
, ... these are ok
It was not possible to unzip the file; I read it with data.json.gz
Any suggestion? How I can read it with json reader?
Thanks
Upvotes: 1
Views: 1120
Reputation: 307
You mention the size after unzipping but also say "It was not possible to unzip the file". If you are reading a compressed file from HDFS, the whole thing will be pulled into memory as it cannot be split. This could be the the lead to OOMEs.
What do you mean exactly by "it crashes"? What exception is being thrown?
Upvotes: 1
Reputation: 145
You can try loading it all as string by providing manual schema, this should ease the processing.
schema = t.StructType([
t.StructField("Name", t.StringType(), True),
t.StructField("Age", t.StringType(), True),
...
])
df = spark.read \
.json('path-to-csv', schema=schema)
Upvotes: 1