Very slow parquet reads

Question

I'm trying to read parquet files from two different locations A and B. Both of them are parquet files in GCP with approximately same number of columns in the schema (80 - 90, mostly string). B is incredibly small in file size and record count (about 5 orders of magnitude smaller than A). But it takes approximately same time to read from GS as A. I'm wondering why that is.

scala> show_timing{spark.read.parquet("gs://bucket-name/tables/A/year=2018/month=4/day=5/*")}
Time elapsed: 34862525 microsecs
res5: org.apache.spark.sql.DataFrame = [a1: string, a2: string ...     84 more fields]

scala> show_timing{spark.read.parquet("gs://bucket-name/tables/B/year=2018/month=4/day=5/*")}
Time elapsed: 25094417 microsecs
res6: org.apache.spark.sql.DataFrame = [b1: string,     b2: string ... 81 more fields]

scala> res5.count()
res7: Long = 2404444736

scala> res6.count()
res8: Long = 98787

My spark version is 2.2. I understand that this is not much information to begin with. But I'm not quite sure what else to investigate.

Very slow parquet reads

Answers (1)

Related Questions