Processing json much slower than csv with multiple cores

Question

I have a json and csv file with the identical content of 1.8m amazon reviews.

I am performing 2 operations: DataFrame row count and production of TFIDF of text data. I tried this operation with 1,2,4, and 8 cores. With the increase of cores, the processing speed of csv-based DataFrame is increasing proportionally, but json-based stays the same.

Row count example:

data = spark.read.format("csv").option("header", "true").load("path/amazon_half.csv")
%timeit -n 10 data.count()
djs = spark.read.format("json").option("header", "true").load("path/amazon_half.json")
%timeit -n 10 djs.count()

The attached table represents the time it takes in seconds to perform these operations with a different number of cores.

I would expect that the time required to process json and csv with the same data would be more or less equal. Is this normal and if so, is there a way to process json at the same speed as csv in spark?

Thiago Baldim · Accepted Answer

No it will never be the same speed.

First, Json is the worst format for Big Data. We use to say, if the data is big why not use json to be bigger.

Spark use to create a Columnar abstraction of the data. So reading CSV is much faster. Due to the file is simple to process and smaller.

See the data of a CSV looks like this:

key, value
a, 2
b, 3
c, 44

JSON looks like this:

{key: "a", value: 2}, 
{key: "b", value: 3}, 
{key: "c", value: 44}

So you can see JSON has more data. And Spark when need to parse the data the master need to shuffle this data thru the other nodes before parsing. So there is more data to shuffle, that increase the time of shuffling. And your csv is much faster to convert to DS than JSON. JSON need to convert to one Map object and after that convert to DS.

Those are the reasons.

Processing json much slower than csv with multiple cores

Answers (1)

Related Questions