Erick Díaz
Erick Díaz

Reputation: 98

Slow performance reading parquet files in S3 with scala in Spark

I save a partitioned file in a s3 bucket from a data frame in scala

data_frame.write.mode("append").partitionBy("date").parquet("s3n://...")

When I read this partitioned file I'm experimenting very slow performance, I'm just doing a simple group by

val load_df = sqlContext.read.parquet(s"s3n://...").cache()

I also try load_df.registerTempTable("dataframe")

Any advice, I'm doing something wrong?

Upvotes: 1

Views: 3170

Answers (2)

C4stor
C4stor

Reputation: 8026

You should use the S3A driver (which may be as simple as changing your url protocol to s3a://, or maybe you'll need some extra classpath to have hadoop-aws and aws-sdk jars in it) to have better perfs.

Upvotes: 0

slovit
slovit

Reputation: 137

It depends on what you mean by "very slow performance".

If you have too many files in you date partition it will take some time to read those.

Try to reduce granularity of the partition.

Upvotes: 3

Related Questions