user3121051
user3121051

Reputation: 21

How to create multidimensional array and save in textfile in scala

I'm trying to open a text file, process each line and store the result in a multidimensional array.

My input file contains:

1 1 3 2  
2 2.2 3 1.8  
3 3 1.2 2.5   

and I want to create a 3x4 array like this:

(1, 1, 3, 2)  
(2, 2.2, 3 1.8)  
etc

My code is:

for (line <- Source.fromFile(inputFile).getLines) {
 var counters = line.split("\\s+")
 sc.parallelize(counters).saveAsTextFile(outputFile)
}

I am trying to save the results in a text but firstly I got an exception in the running process which is:

apache.hadoop.mapred.FileAlreadyExistsException:
  Output directory file:/home/user/Desktop/output.txt already exists

I guess that is about the parallelize but that was the only way I found to save an array.

Also, what is stored is not what I want. The file has two partition files that contain:

part1:

1  
1  

part2:

3  
2  

How can I create a multidimensional array from one dimension arrays and how can I save it in a text file?

Upvotes: 0

Views: 292

Answers (1)

Tzach Zohar
Tzach Zohar

Reputation: 37822

You're creating a separate RDD (and saving it to file) for each line, instead of one RDD for the entire file. Also, since you're using Spark (see disclaimers) to write the file - you'd benefit from also using it to read it. Here's how you can fix it:

sc.textFile(inputFile)
  .map(_.split("\\s+").mkString(",")) // if you want result to be comma-delimited
  .repartition(1) // if you want to make sure output has one partition (file)
  .saveAsTextFile(outputFile)

A few disclaimers though:

  • If the file is indeed relatively small (so you can load it using fromFile) - why do you need Spark? Spark should usually be used for data that is too large for a single file / single process's memory to handle
  • You'll have to make sure outputFile doesn't exist before you run this - otherwise you'll see the same exception (Spark is careful not to override your data, so it fails if output file, which is actually a folder, already exists)

Upvotes: 1

Related Questions