Reputation: 369
I'm trying to read parquet files from two different locations A and B. Both of them are parquet files in GCP with approximately same number of columns in the schema (80 - 90, mostly string). B is incredibly small in file size and record count (about 5 orders of magnitude smaller than A). But it takes approximately same time to read from GS as A. I'm wondering why that is.
scala> show_timing{spark.read.parquet("gs://bucket-name/tables/A/year=2018/month=4/day=5/*")}
Time elapsed: 34862525 microsecs
res5: org.apache.spark.sql.DataFrame = [a1: string, a2: string ... 84 more fields]
scala> show_timing{spark.read.parquet("gs://bucket-name/tables/B/year=2018/month=4/day=5/*")}
Time elapsed: 25094417 microsecs
res6: org.apache.spark.sql.DataFrame = [b1: string, b2: string ... 81 more fields]
scala> res5.count()
res7: Long = 2404444736
scala> res6.count()
res8: Long = 98787
My spark version is 2.2. I understand that this is not much information to begin with. But I'm not quite sure what else to investigate.
Upvotes: 2
Views: 6111
Reputation: 5223
The reason is that spark is not actually reading the data when it executes the read.parquet
operation, so the read operation takes roughly the same time. read
is lazy, that is, data is accessed only when you execute an action (like count
). I bet the two count operations don't take the same time!
When read
is executed, it only reads the metadata of the parquet file to figure out the schema, so the file size doesn't matter that much.
Have a look at transformations vs actions in spark. Some operations trigger computation (and consequently I/O to materialize the rdd), some don't: https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations
Upvotes: 1