Reputation: 21
I'm trying to open a text file, process each line and store the result in a multidimensional array.
My input file contains:
1 1 3 2
2 2.2 3 1.8
3 3 1.2 2.5
and I want to create a 3x4 array like this:
(1, 1, 3, 2)
(2, 2.2, 3 1.8)
etc
My code is:
for (line <- Source.fromFile(inputFile).getLines) {
var counters = line.split("\\s+")
sc.parallelize(counters).saveAsTextFile(outputFile)
}
I am trying to save the results in a text but firstly I got an exception in the running process which is:
apache.hadoop.mapred.FileAlreadyExistsException:
Output directory file:/home/user/Desktop/output.txt already exists
I guess that is about the parallelize but that was the only way I found to save an array.
Also, what is stored is not what I want. The file has two partition files that contain:
part1:
1
1
part2:
3
2
How can I create a multidimensional array from one dimension arrays and how can I save it in a text file?
Upvotes: 0
Views: 292
Reputation: 37822
You're creating a separate RDD (and saving it to file) for each line, instead of one RDD for the entire file. Also, since you're using Spark (see disclaimers) to write the file - you'd benefit from also using it to read it. Here's how you can fix it:
sc.textFile(inputFile)
.map(_.split("\\s+").mkString(",")) // if you want result to be comma-delimited
.repartition(1) // if you want to make sure output has one partition (file)
.saveAsTextFile(outputFile)
A few disclaimers though:
fromFile
) - why do you need Spark? Spark should usually be used for data that is too large for a single file / single process's memory to handleoutputFile
doesn't exist before you run this - otherwise you'll see the same exception (Spark is careful not to override your data, so it fails if output file, which is actually a folder, already exists)Upvotes: 1