Mahdi
Mahdi

Reputation: 817

Filestreams limitations in Spark Streaming

I need to develop a streaming application which read some session logs from several sources. The batch interval could be in a scale around 5 minutes..

The problem is that the files I get in each batch vary enormously. In one in each batch I may get some file with 10 megabyte and then in another batch getting some files around 20GB.

I want to know if there is any approach to handle this..Is there any limitation for the size of RDDs a file stream can generate for each batch?

Can I limit the spark streaming to read just a fixed amount of data in each batch into the RDD?

Upvotes: 0

Views: 535

Answers (1)

vgunnu
vgunnu

Reputation: 844

As of I know there is no direct way to limit that. File to considered is controlled in isNewFile private function in FileStream. Based on the code I can think of one work around.

Use filter function to limit the number of files to be read. Any files more then 10 return false and use touch command to update the timestamp of the file to be considered for next window.

globalcounter=10
val filterF = new Function[Path, Boolean] {
  def apply(file: Path): Boolean = {
    globalcounter --
    if(globalcounter > 0) {
      return true // consider only 10 files.
    }
    // touch the file so that timestamp of the file is updated. 
    return false 
  }
}

Upvotes: 0

Related Questions