jpavs
jpavs

Reputation: 658

How to Improve Performance with Spark Streaming Against S3

I have a Spark Streaming job that is using streamContext.textFileStream("s3://log-directory/") to listen for files and then parse them and output them as ORC files. This particular directory has a lot of files streaming in - over 40 files every 5 minutes.

However, no matter how many machines I add on my EMR cluster, it seems to stall at having between 3 and 4 executors at any given time, running a max of 30 tasks. This makes the job fall behind quite quickly, because each microbatch job takes about 5 minutes to parse and convert all the data, and it's only parsing between 4-10 log files at a time. Ideally, each batch would process all 40+ files in that five minutes, then move onto the next set, basically keeping pace with the stream.

So, my question - is there a way to increase the number of running executors? Or is there some other problem I haven't thought of that would allow my job to keep pace with new files? I've read a few things about how S3 can be quite slow with Spark, but my job keeps returning messages like 17/04/21 19:14:32 INFO FileInputDStream: Finding new files took 2135 ms. This isn't fast but it's also not 5 minutes, so I feel like it's not a problem with finding files. As for environment, I currently set spark.maximizeResourceAllocation=true as a config option, which results in:

spark.default.parallelism = 160
spark.executors.cores =     8
spark.executor.memory =     10356M

This seems like it should be sufficient as well - the files are currently maxing out at 100mb each. I appreicate any help you can give, and happy to add more details if necessary.

Upvotes: 1

Views: 1125

Answers (2)

stevel
stevel

Reputation: 13430

Spark's FileInputDStream is very inefficient in the way it scans directories; it ends up scanning every file multiple times to get the timestamp and see if it can be excluded. This surfaces against any object store where the GET request to get a timestamp of a file takes 100+ms.

The best thing you can do right now (until SPARK-17159 gets in) is to keep that directory free of any old files you have already processed. They aren't needed, yet they still keep being scanned, slowing down your program the more there are.

Upvotes: 0

Thiago Baldim
Thiago Baldim

Reputation: 7732

This is a problem of Archtecture,

Using spark to take the information from an ELB and write the information direct in Orc is not the best solution at all.

The write process is slow with orc due to compression, and IO to the disk. What I suggest you is to change the process, create a job in Spark to write your data to the Kinesis or Kafka to be a fast write. In the other side of the process, if you are using Kinesis you can use the FireHose to write your data in your S3 or even the Spark to write in S3 with a stream process with a bigger window of time.

If you want to take your data fast, to your fast data or something I suggest you to use Presto or Athena from AWS to take the data direct from Kinesis. Or other tool that you need.

I hope this can help.

Upvotes: 2

Related Questions