Reputation: 8529
I am trying to process a file with spark but my input file has a single "record" of information spread over 3 lines.
Rec1 Line1
Rec1 Line2
Rec1 Line3
Rec2 Line1
Rec2 Line2
Rec2 Line3
There is no key linking the lines of a record, the only connection is that they are three lines next to each other. There is no record separator beyond the knowing that the 4th line is the start of a new record. All other questions I saw related to multi-line records seem to have some sort of obvious record separator while I have none in this case, I have to rely on line count.
My first thought is to use the sliding
function from org.apache.spark.mllib.rdd.RDDFunctions
sc.textFile("myFile.txt").sliding(3,3)
This turns my RDD[String] into and RDD[Array[String]] where each element in the RDD is 3 lines from the file.
In some tests this looks like it works are gets me the result I want, however I notice that the sliding
function actually causes a collect
during its evaluation. This has me concerned, what is it collecting? Is it the entire data file or something else? My file will be too large to collect the entire thing onto the driver.
Is sliding
the best way to read this file, or is there a more efficient way to do it?
Upvotes: 1
Views: 742
Reputation: 3725
The collect()
call you're seeing doesn't collect all the RDD data, but rather partition summary information. Calling .sliding
will cause your text file to be read an extra time to compute this information, but it won't cause you to blow out your driver memory.
I learned this from reading the code in org.apache.spark.mllib.rdd.SlidingRDD
in Spark 2.0.2.
For your purpose, .sliding
seems to be the best bet.
Upvotes: 1