Spark read.parquet takes too much time

Question

Hi I don't understand why this code takes too much time.

val newDataDF = sqlContext.read.parquet("hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711*/*")

It's supposed than no bytes are transferred to the driver program, isn't it? How does read.parquet works?

What I can see from the Spark web UI is that read.spark fires about 4000 tasks (there's a lot of parquet files inside that folder).

Silvio · Accepted Answer

The issue most likely is the file indexing that has to occur as the first step of loading a DataFrame. You said the spark.read.parquet fires off 4000 tasks, so you probably have many partition folders? Spark will get an HDFS directory listing and recursively get the FileStatus (size and splits) of all files in each folder. For efficiency Spark indexes the files in parallel, so you want to ensure you have enough cores to make it as fast as possible. You can also be more explicit in the folders you wish to read or define a Parquet DataSource table over the data to avoid the partition discovery each time you load it.

spark.sql("""
create table mydata
using parquet
options(
  path 'hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711*/*'
)
""")

spark.sql("msck repair table mydata")

From this point on, when you query the data it will no longer have to do the partition discovery, but it'll still have to get the FileStatus for the files within the folders you query. If you add new partitions you can either add the partition explicitly of force a full repair table again:

spark.sql("""
alter table mydata add partition(foo='bar')
location 'hdfs://192.168.111.70/u01/dw/prod/stage/br/ventas/201711/foo=bar'
""")

Spark read.parquet takes too much time

Answers (1)

Related Questions