Reputation: 2124
I have read a text file in Spark using the command
val data = sc.textFile("/path/to/my/file/part-0000[0-4]")
I would like to add a new line as a header of my file. Is there a way to do that without turning the RDD into an Array?
Thank you!
Upvotes: 6
Views: 8739
Reputation: 27455
"Part" files are automatically handled as a file set.
val data = sc.textFile("/path/to/my/file") // Will read all parts.
Just add the header and write it out:
val header = sc.parallelize(Seq("...header..."))
val withHeader = header ++ data
withHeader.saveAsTextFile("/path/to/my/modified-file")
Note that because this has to read and write all the data, it will be quite a bit slower than what you may intuitively expect. (After all you're just adding a single new line!) For this reason and others, you may be better off not adding this header, and instead storing the metadata (list of columns) separately from the data.
Upvotes: 2
Reputation: 16308
You can not actually control whether new line will be first (header) or not, but you can create new singleton RDD and merge it with existent:
val extendedData = data ++ sc.makeRDD(Seq("my precious new line"))
so
extendedData.filter(_ startsWith "my precious").first()
will probably prove your line is added
Upvotes: 1