Reputation: 99
I'm currently working on a nlp-project. As a training data set I'm using the Bible. You can easily create a random corpus with, if you want to try it yourself:
rcorpus(nwords = 50, alphabet = letters, minwordlen = 1, maxwordlen = 6)
After processing the text file, I'm dividing the corpus into n-grams with ngram-package
library(ngram)
# this is a preprocessed Corpus I have created earlier
bible_corpus<- Corpus(DirSource("C:/Users/XYZ/XYZ/"))
Now I'm processing the corpus with a function, I have set up earlier.
corpus_sentences <- Text_To_Clean_Sentences(paste(bible_corpus, collapse=" "))
Next step is to make a function for splitting our corpus into ngram
# function for getting n-grams
Get_Ngrams <- function(sentence_splits, ngram_size=2) {
ngrams <- c()
for (sentence in sentence_splits) {
sentence <- Trim(sentence)
if ((nchar(sentence) > 0) && (sapply(gregexpr("\\W+", sentence), length) >=
ngram_size)) {
ngs <- ngram(sentence , n=ngram_size)
ngrams <- c(ngrams, get.ngrams(ngs))
}
}
return (ngrams)
}
# making n-grams based on Get_Ngrams
n2 <- Get_Ngrams(corpus_sentences, ngram_size=2)
n3 <- Get_Ngrams(corpus_sentences, ngram_size=3)
n4 <- Get_Ngrams(corpus_sentences, ngram_size=4)
n5 <- Get_Ngrams(corpus_sentences, ngram_size=5)
# collect all n-grams
n_all <- c(n5,n4,n3,n2)
Time to enter a search term
# enter SEARCH Word
word <- 'good '
#
matches <- c()
for (sentence in n_all) {
# find exact match with double backslash and escape
if (grepl(paste0('\\<',word), sentence)) {
print(sentence)
matches <- c(matches, sentence)
}
}
# find highest probability word
precision_match <- c()
for (a_match in matches) {
# how many spaces in from of search word
precision_match <- c(precision_match,nchar(strsplit(x = a_match, split = word)[[1]][[1]]))
}
The last step returns all ngrams, which contain our search word from line 29.
Now I want to remove all sentence which don't start with search word we have entered.
For example "precision_match" returns:
[1] search_word wordX wordY wordZ
[2] search_word wordY wordX wordZ
[3] wordY search_word wordX wordZ
[4] wordY wordX wordZ search_word
Of course I could manually select [1] und [2] since I can see that these two lines start with our search_word. But this isn't practical with a big number of matches. So how can I extract the n-grams starting with our search_word?
Upvotes: 0
Views: 430
Reputation: 889
I run your code and don't know why you use the last part with the precision_match...
It basically gives you the difference in positions of the search word from the beginning of the string. Your question however seems to be tackled at a step earlier (matches
).
Try
set.seed(33)
test <- rcorpus(nwords = 50, alphabet = letters, minwordlen = 1, maxwordlen = 6)
Get_Ngrams <- function(sentence_splits, ngram_size=2) {
ngrams <- c()
for (sentence in sentence_splits) {
sentence <- trimws(sentence)
if ((nchar(sentence) > 0) && (sapply(gregexpr("\\W+", sentence), length) >=
ngram_size)) {
ngs <- ngram(sentence , n=ngram_size)
ngrams <- c(ngrams, get.ngrams(ngs))
}
}
return (ngrams)
}
ngram2 <- Get_Ngrams(test,2)
ngram3 <- Get_Ngrams(test,3)
n_all <- c(ngram2,ngram3)
word <- 'ghpbw'
matches <- c()
for (sentence in n_all) {
# find exact match with double backslash and escape
if (grepl(paste0('\\<',word), sentence)) {
print(sentence)
matches <- c(matches, sentence)
}
}
matches[grepl(pattern = paste0("^",word), matches)]
which results in the ngrams starting with the search word ghpbw
[1] "ghpbw zbaiou" "ghpbw zbaiou ffrpj"
discarding [1] "wil ghpbw" "dxjv wil ghpbw" "wil ghpbw zbaiou"
Upvotes: 0