Reputation: 5352
I have directories, sub directories setup on HDFS and I'd like to pre process all the files before loading them all at once into memory. I basically have big files (1MB
) that once processed will be more like 1KB
, and then do sc.wholeTextFiles
to get started with my analysis
How do I loop on each file (*.xml
) on my directories/subdirectories, do an operation (let's say for the example's sake, keep the first line), and then dump the result back to HDFS (new file, say .xmlr
) ?
Upvotes: 2
Views: 3204
Reputation: 5018
I'd recommend you just to use sc.wholeTextFiles and preprocess them using transformations, after that save all of them back as a single compressed sequence file (you can refer to my guide to do so: http://0x0fff.com/spark-hdfs-integration/)
Another option might be to write a mapreduce that would process the whole file at a time and save them to the sequence file as I proposed before: https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/SmallFilesToSequenceFileConverter.java. It is the example described in 'Hadoop: The Definitive Guide' book, take a look at it
In both cases you would do almost the same, both Spark and Hadoop will bring up a single process (Spark task or Hadoop mapper) to process these files, so in general both of the approaches will work using the same logic. I'd recommend you to start with a Spark one as it is simpler to implement given the fact you already have a cluster with Spark
Upvotes: 2