puhlen
puhlen

Reputation: 8529

Spark read multi-line records with slidingRDD

I am trying to process a file with spark but my input file has a single "record" of information spread over 3 lines.

Rec1 Line1
Rec1 Line2
Rec1 Line3
Rec2 Line1
Rec2 Line2
Rec2 Line3

There is no key linking the lines of a record, the only connection is that they are three lines next to each other. There is no record separator beyond the knowing that the 4th line is the start of a new record. All other questions I saw related to multi-line records seem to have some sort of obvious record separator while I have none in this case, I have to rely on line count.

My first thought is to use the sliding function from org.apache.spark.mllib.rdd.RDDFunctions

sc.textFile("myFile.txt").sliding(3,3)

This turns my RDD[String] into and RDD[Array[String]] where each element in the RDD is 3 lines from the file.

In some tests this looks like it works are gets me the result I want, however I notice that the sliding function actually causes a collect during its evaluation. This has me concerned, what is it collecting? Is it the entire data file or something else? My file will be too large to collect the entire thing onto the driver.

Is sliding the best way to read this file, or is there a more efficient way to do it?

Upvotes: 1

Views: 742

Answers (1)

Tim
Tim

Reputation: 3725

The collect() call you're seeing doesn't collect all the RDD data, but rather partition summary information. Calling .sliding will cause your text file to be read an extra time to compute this information, but it won't cause you to blow out your driver memory.

I learned this from reading the code in org.apache.spark.mllib.rdd.SlidingRDD in Spark 2.0.2.

For your purpose, .sliding seems to be the best bet.

Upvotes: 1

Related Questions