Spark read with filter

Question

I am working with the Spark Java API. I am trying to read a file from a directory and filter some lines out. My code looks something like this:

final JavaSparkContext jsc = new JavaSparkContext(sparkConf);
JavaRDD textFile = jsc .textFile("/path/to/some/file");

//First Read....
JavaRDD parsedMessages = textFile.map(....);

//Then Filter
JavaRDD queryResults = parsedMessages.filter(....)

Is there a way to combine the read and filter operation into the same operation? Something like read with filter? I have a very specific requirement where I have to query a very large data set, but I get a relatively small result set back. I then have to do a series of transformations and calculations on that filtered data. I don't want to read the whole data set into memory and then filter it out. I don't have that much memory. What I would like to do instead is to filter it at read time so only the lines matching some Regex would be read in. Is this possible to do with Spark?

zero323 · Accepted Answer

At least with SparkContext.textFile there is no such an option but it shouldn't be a problem. There is no requirement that all data has to reside in memory at any point other than collecting on a driver. Data is read in chunks and you can decrease size of an individual split using minPartitions parameter.

My advice is to use a normal filter operation as soon as you can and persist resulting RDD to avoid recomputation.

Spark read with filter

Answers (2)

Related Questions