Silvester
Silvester

Reputation: 3149

How to make string sequence by mapping in scala?

object NGram{
    def main(args: Array[String]) {
      //args(0) = textfile //args(1) = size of n-grams //args(2) = the number of words to generate
        val F = scala.io.Source.fromFile(args(0)) // take from args[0]
        for (line <- F.getLines()){
        val words = line.split("[ ,:;.?!-]+") map (_.toLowerCase)
        var ngram : Set[String] = Set()
        //make n-gram
        for(i <- 0 to words.size - args(1)) {
          // first make sequence by args(1)
          for(j <- i until i + args(1)){
            ngram = ngram + words(j) // not works it is my problem stage
          }


          }
        }
    }
}

I made n-gram algorithm by using scala. at first

  1. make string sequence, and check it is in original string.
  2. and It is efficiently works.

I want n string sequence not duplicated (because it must work efficiently)

How to make n string sequence by map?

Upvotes: 1

Views: 3759

Answers (1)

Didier Dupont
Didier Dupont

Reputation: 29528

Am I correct that :

  • you have a sequence of words (not sure from your code whether it should be a single line or the full file)
  • an n-gram is a sequence of n words consecutive in the original sequence
  • you want a certain number of distinct n-gram.

There is a routine that will give you n-grams, it is sliding. with

val words = Seq("the", "brown", "fox", "jumps", "over", "the", "lazy", "dog")
val trigrams = words.sliding(3).toSeq
foreach(triGram in triGrams) println(triGram.mkString(" "))
the brown fox
brown fox jumps
fox jumps over
jumps over the
over the lazy
the lazy dog

There is a caveat, if you have only p words and want n-grams, with n > p, sliding will return one p-gram (not an n-gram obviously) rather than none. So you have to check for that.

You can do toSet rather than toSeq to eliminate duplicates.

There is the last point, you want only a certain number of n-grams (your last argument). You did not specify how you want to select them. The simple way would be a take. To avoid to go through the whole list of words, and take the count first distinct one, that would be

words.sliding(n).toStream.distinct.take(count)

If you want to take them at random position, that is a different story and maybe sliding is not the way to go.

Upvotes: 3

Related Questions