Vishnu Prathish
Vishnu Prathish

Reputation: 369

Very slow parquet reads

I'm trying to read parquet files from two different locations A and B. Both of them are parquet files in GCP with approximately same number of columns in the schema (80 - 90, mostly string). B is incredibly small in file size and record count (about 5 orders of magnitude smaller than A). But it takes approximately same time to read from GS as A. I'm wondering why that is.

scala> show_timing{spark.read.parquet("gs://bucket-name/tables/A/year=2018/month=4/day=5/*")}
Time elapsed: 34862525 microsecs
res5: org.apache.spark.sql.DataFrame = [a1: string, a2: string ...     84 more fields]

scala> show_timing{spark.read.parquet("gs://bucket-name/tables/B/year=2018/month=4/day=5/*")}
Time elapsed: 25094417 microsecs
res6: org.apache.spark.sql.DataFrame = [b1: string,     b2: string ... 81 more fields]

scala> res5.count()
res7: Long = 2404444736

scala> res6.count()
res8: Long = 98787

My spark version is 2.2. I understand that this is not much information to begin with. But I'm not quite sure what else to investigate.

Upvotes: 2

Views: 6111

Answers (1)

Roberto Congiu
Roberto Congiu

Reputation: 5223

The reason is that spark is not actually reading the data when it executes the read.parquet operation, so the read operation takes roughly the same time. read is lazy, that is, data is accessed only when you execute an action (like count). I bet the two count operations don't take the same time!

When read is executed, it only reads the metadata of the parquet file to figure out the schema, so the file size doesn't matter that much.

Have a look at transformations vs actions in spark. Some operations trigger computation (and consequently I/O to materialize the rdd), some don't: https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations

Upvotes: 1

Related Questions