Spark read multi-line records with slidingRDD

Question

I am trying to process a file with spark but my input file has a single "record" of information spread over 3 lines.

Rec1 Line1
Rec1 Line2
Rec1 Line3
Rec2 Line1
Rec2 Line2
Rec2 Line3

There is no key linking the lines of a record, the only connection is that they are three lines next to each other. There is no record separator beyond the knowing that the 4th line is the start of a new record. All other questions I saw related to multi-line records seem to have some sort of obvious record separator while I have none in this case, I have to rely on line count.

My first thought is to use the sliding function from org.apache.spark.mllib.rdd.RDDFunctions

sc.textFile("myFile.txt").sliding(3,3)

This turns my RDD[String] into and RDD[Array[String]] where each element in the RDD is 3 lines from the file.

In some tests this looks like it works are gets me the result I want, however I notice that the sliding function actually causes a collect during its evaluation. This has me concerned, what is it collecting? Is it the entire data file or something else? My file will be too large to collect the entire thing onto the driver.

Is sliding the best way to read this file, or is there a more efficient way to do it?

Spark read multi-line records with slidingRDD

Answers (1)

Related Questions