james
james

Reputation: 451

Spark read with filter

I am working with the Spark Java API. I am trying to read a file from a directory and filter some lines out. My code looks something like this:

final JavaSparkContext jsc = new JavaSparkContext(sparkConf);
JavaRDD<String> textFile = jsc .textFile("/path/to/some/file");

//First Read....
JavaRDD<Msg> parsedMessages = textFile.map(....);

//Then Filter
JavaRDD<Msg> queryResults = parsedMessages.filter(....)

Is there a way to combine the read and filter operation into the same operation? Something like read with filter? I have a very specific requirement where I have to query a very large data set, but I get a relatively small result set back. I then have to do a series of transformations and calculations on that filtered data. I don't want to read the whole data set into memory and then filter it out. I don't have that much memory. What I would like to do instead is to filter it at read time so only the lines matching some Regex would be read in. Is this possible to do with Spark?

Upvotes: 0

Views: 5031

Answers (2)

Arnon Rotem-Gal-Oz
Arnon Rotem-Gal-Oz

Reputation: 25909

Spark doesn't perform the code exactly how you write it - it goes through an optimizer. The way this code is written (read, map and filter -with no shuffling action in between) spark will actually perform the read, the map transformation and the filter for each line as it is read - i.e. it won't need all the data in memory

Upvotes: 2

zero323
zero323

Reputation: 330163

At least with SparkContext.textFile there is no such an option but it shouldn't be a problem. There is no requirement that all data has to reside in memory at any point other than collecting on a driver. Data is read in chunks and you can decrease size of an individual split using minPartitions parameter.

My advice is to use a normal filter operation as soon as you can and persist resulting RDD to avoid recomputation.

Upvotes: 1

Related Questions