Selecting only ngrams based on the first word in rstudio

Question

I'm currently working on a nlp-project. As a training data set I'm using the Bible. You can easily create a random corpus with, if you want to try it yourself:

rcorpus(nwords = 50, alphabet = letters, minwordlen = 1, maxwordlen = 6)

After processing the text file, I'm dividing the corpus into n-grams with ngram-package

library(ngram)
# this is a preprocessed Corpus I have created earlier

bible_corpus<- Corpus(DirSource("C:/Users/XYZ/XYZ/"))

Now I'm processing the corpus with a function, I have set up earlier.

corpus_sentences <- Text_To_Clean_Sentences(paste(bible_corpus, collapse=" "))

Next step is to make a function for splitting our corpus into ngram

# function for getting n-grams
Get_Ngrams <- function(sentence_splits, ngram_size=2) {
ngrams <- c()
for (sentence in sentence_splits) {
sentence <- Trim(sentence)
if ((nchar(sentence) > 0) && (sapply(gregexpr("\W+", sentence), length) >= 
ngram_size)) {
    ngs <- ngram(sentence , n=ngram_size)
    ngrams <- c(ngrams, get.ngrams(ngs))
     }
}
 return (ngrams)
}

# making n-grams based on Get_Ngrams
n2 <- Get_Ngrams(corpus_sentences, ngram_size=2)   
n3 <- Get_Ngrams(corpus_sentences, ngram_size=3)
n4 <- Get_Ngrams(corpus_sentences, ngram_size=4)
n5 <- Get_Ngrams(corpus_sentences, ngram_size=5)

# collect all n-grams
n_all <- c(n5,n4,n3,n2)

Time to enter a search term

# enter SEARCH Word
word <- 'good '

#
matches <- c()
for (sentence in n_all) {
    # find exact match with double backslash and escape
    if (grepl(paste0('\<',word), sentence)) {
        print(sentence)
        matches <- c(matches, sentence)
    }
}

# find highest probability word
precision_match <- c()
for (a_match in matches) {
    # how many spaces in from of search word
    precision_match <- c(precision_match,nchar(strsplit(x = a_match, split = word)[[1]][[1]]))
}

The last step returns all ngrams, which contain our search word from line 29.

Now I want to remove all sentence which don't start with search word we have entered.

For example "precision_match" returns:

[1] search_word wordX wordY wordZ
[2] search_word wordY wordX wordZ
[3] wordY search_word wordX wordZ
[4] wordY wordX wordZ search_word

Of course I could manually select [1] und [2] since I can see that these two lines start with our search_word. But this isn't practical with a big number of matches. So how can I extract the n-grams starting with our search_word?

Selecting only ngrams based on the first word in rstudio

Answers (1)

Related Questions