Reputation: 98
I save a partitioned file in a s3 bucket from a data frame in scala
data_frame.write.mode("append").partitionBy("date").parquet("s3n://...")
When I read this partitioned file I'm experimenting very slow performance, I'm just doing a simple group by
val load_df = sqlContext.read.parquet(s"s3n://...").cache()
I also try
load_df.registerTempTable("dataframe")
Any advice, I'm doing something wrong?
Upvotes: 1
Views: 3170
Reputation: 8026
You should use the S3A driver (which may be as simple as changing your url protocol to s3a://, or maybe you'll need some extra classpath to have hadoop-aws and aws-sdk jars in it) to have better perfs.
Upvotes: 0
Reputation: 137
It depends on what you mean by "very slow performance".
If you have too many files in you date
partition it will take some time to read those.
Try to reduce granularity of the partition.
Upvotes: 3