garak
garak

Reputation: 4803

Scala Large Text file

I'm a newbie with Scala programming.

I have to deal with an NLP task.

I'm having trouble with processing a large text file in Scala.

I have read the entire text of a 100+ M.B file onto memory (into a string) and have to process it (I believe processing large text files is a common task in Natural Language Processing).

The goal is to count the number of unique substrings/words in the given string (which is the entire file).

I wanted to use "distinct" method in List object, but converting the string into a list using ".split" method raises out of memory error ("java.lang.OutOfMemoryError: Java heap space" Error).

I was wondering if I could accomplish this task without using lists using String or Regular Expression methods in Scala?

Upvotes: 2

Views: 2699

Answers (3)

Jack
Jack

Reputation: 16718

Have a look at this blog that discusses your problem and different approaches to it.

Upvotes: 1

tgr
tgr

Reputation: 3608

I assume, that you have your File as a List[String] in memory and every entry in the List is a line of the File.

val textStream = text.toStream
val wordStream = textStream.view.flatMap(s => s.split(" "))
val distinctWordStream = wordStream.foldLeft(Stream.empty[String])((stream, string) =>
  if (stream.contains(string)) stream else string #:: stream
)

Firstly you create a Stream, so you don't have to deal with the whole String. The next step is creating a View and maping it, so you have only one word in every String instead of one line. Last you fold the result word by word. If a word is contained yet, it will be droped. Instead of folding you could also use this line:

val wordSet = wordStream.toSet

Getting the number of distinct words should be trivial at this point. You only have to call length or size for the Set.

Upvotes: 2

Randall Schulz
Randall Schulz

Reputation: 26486

It's certainly true that the default JVM heap size is probably going to have to be increased. I doubt greatly that using split or any other RE-based approach is going to be tractable for that large an input. Likewise you're going to see an excessive increase in memory requirements if you convert the input to a List[Char] to exploit the wonderful collections library; the size inflation will be minimally a decimal order of magnitude.

Given the relatively simple decomposition (words separated by white-space or punctuation) I think a more prosaic solution may be necessary. Iterate imperatively over the characters of the string (but not via an implicit conversion to any kind of Seq[Char]) and find the words, dumping them into a mutable.Set[String]. That will eliminate duplicates, for one thing. Perhaps use a Buffer[Char] to accumulate the characters of each word before turning them into a String to be added to the Set[String].

Here's a cut at it:

package rrs.scribble

object  BigTextNLP {
  def btWords(bt: String): collection.mutable.Set[String] = {
    val btLength = bt.length
    val wordBuffer = collection.mutable.Buffer[Char]()
    val wordSet = collection.mutable.Set[String]()

    /* Assuming btLength > 0 */

    import bt.{charAt => chr}
    import java.lang.Character.{isLetter => l}

    var inWord = l(chr(0))

    (0 until btLength) foreach { i =>
      val c = chr(i)
      val lc = l(c)

      if (inWord)
        if (lc)
          wordBuffer += c
        else {
          wordSet += wordBuffer.mkString
          wordBuffer.clear
          inWord = false
        }
      else
        if (lc) {
          inWord = true
          wordBuffer += c
        }
    }

    wordSet
  }
}

In the REPL:

scala> import rrs.scribble.BigTextNLP._
import rrs.scribble.BigTextNLP._

scala> btWords("this is a sentence, maybe!")
res0: scala.collection.mutable.Set[String] = Set(this, maybe, sentence, is, a)

Upvotes: 5

Related Questions