Reputation: 3149
object NGram{
def main(args: Array[String]) {
//args(0) = textfile //args(1) = size of n-grams //args(2) = the number of words to generate
val F = scala.io.Source.fromFile(args(0)) // take from args[0]
for (line <- F.getLines()){
val words = line.split("[ ,:;.?!-]+") map (_.toLowerCase)
var ngram : Set[String] = Set()
//make n-gram
for(i <- 0 to words.size - args(1)) {
// first make sequence by args(1)
for(j <- i until i + args(1)){
ngram = ngram + words(j) // not works it is my problem stage
}
}
}
}
}
I made n-gram algorithm by using scala. at first
I want n string sequence not duplicated (because it must work efficiently)
How to make n string sequence by map?
Upvotes: 1
Views: 3759
Reputation: 29528
Am I correct that :
There is a routine that will give you n-grams, it is sliding
.
with
val words = Seq("the", "brown", "fox", "jumps", "over", "the", "lazy", "dog")
val trigrams = words.sliding(3).toSeq
foreach(triGram in triGrams) println(triGram.mkString(" "))
the brown fox brown fox jumps fox jumps over jumps over the over the lazy the lazy dog
There is a caveat, if you have only p words and want n-grams, with n > p, sliding will return one p-gram (not an n-gram obviously) rather than none. So you have to check for that.
You can do toSet
rather than toSeq
to eliminate duplicates.
There is the last point, you want only a certain number of n-grams (your last argument). You did not specify how you want to select them. The simple way would be a take. To avoid to go through the whole list of words, and take the count
first distinct one, that would be
words.sliding(n).toStream.distinct.take(count)
If you want to take them at random position, that is a different story and maybe sliding
is not the way to go.
Upvotes: 3