Aarsh Shah
Aarsh Shah

Reputation: 91

Scala : File reading to Build an external Merge Sort

I want to implement an external Merge Sort in Scala. It is used to sort huge files which can not fit in the main memory in their entirety.

Details can be found here :- How external merge sort algorithm works?

Now, I need to read chunks of the file, sort it and write it to disk etc etc

What's the most idiomatic / functional way of reading/writing a huge file in parts ?

Upvotes: 0

Views: 595

Answers (1)

Chunked Sorting/Writing

Suppose you want to keep N numbers in memory at a time and further suppose that you are given a function which writes the N sorted numbers to a file:

val N : Int = ???

val writeToFile : Seq[Int] => Unit = ???

As indicated in your question, an Iterator can be used to only keep N numbers in RAM at a time to sort them and write them to the intermediate files:

val sourceFileName : String = ???

val sortAndWrite : Seq[Int] => Unit = 
  (_.sorted) andThen writeToFile

Source
  .fromFile(sourceFileName)
  .getLines
  .map(_.toInt)
  .grouped(N)
  .foreach(sortAndWrite)

Now you have each group of N numbers in a different file. The only thing left to do is merge the files together.

Merging

Given some reading function that returns the Iterators from each of the sub-files:

val subFiles : Iterable[Iterator[String]] = ???

We can write a function that will return a new Iterator that gets the values from each file and sorts them:

val mergeSort : Iterable[Iterator[String]] => Iterator[Int] = 
  (fileIterators) => {

    val nonEmptyFiles = fileIterators filter (_.hasNext)

    nonEmptyFiles
      .map(_.next)
      .map(_.toInt)
      .sorted
      .toIterator ++ mergeSort(nonEmptyFiles)
  }

Note: the above function will keep a single Integer in memory for each file, therefore the RAM usage is dependent on how many different files that writeToFile created.

Now just write the values to a file:

 val destinationFileName : String = ???

 val writer : Writer = new FileWriter(destinationFileName)

 mergeSort(subFiles) foreach (i => writer write i.toString)

Imperfect Sorting

One thing to note: if N is small and the source file is not sufficiently random then the solution will not produce a perfect sort. Example: suppose N = 2 and the initial list was [10,11,0,1] then the algorithm, after one pass, would produce [0,10,1,11] as the result.

Upvotes: 2

Related Questions