azach
azach

Reputation: 99

Selecting only ngrams based on the first word in rstudio

I'm currently working on a nlp-project. As a training data set I'm using the Bible. You can easily create a random corpus with, if you want to try it yourself:

rcorpus(nwords = 50, alphabet = letters, minwordlen = 1, maxwordlen = 6)

After processing the text file, I'm dividing the corpus into n-grams with ngram-package

library(ngram)
# this is a preprocessed Corpus I have created earlier

bible_corpus<- Corpus(DirSource("C:/Users/XYZ/XYZ/"))

Now I'm processing the corpus with a function, I have set up earlier.

corpus_sentences <- Text_To_Clean_Sentences(paste(bible_corpus, collapse=" "))

Next step is to make a function for splitting our corpus into ngram

# function for getting n-grams
Get_Ngrams <- function(sentence_splits, ngram_size=2) {
ngrams <- c()
for (sentence in sentence_splits) {
sentence <- Trim(sentence)
if ((nchar(sentence) > 0) && (sapply(gregexpr("\\W+", sentence), length) >= 
ngram_size)) {
    ngs <- ngram(sentence , n=ngram_size)
    ngrams <- c(ngrams, get.ngrams(ngs))
     }
}
 return (ngrams)
}

# making n-grams based on Get_Ngrams
n2 <- Get_Ngrams(corpus_sentences, ngram_size=2)   
n3 <- Get_Ngrams(corpus_sentences, ngram_size=3)
n4 <- Get_Ngrams(corpus_sentences, ngram_size=4)
n5 <- Get_Ngrams(corpus_sentences, ngram_size=5)

# collect all n-grams
n_all <- c(n5,n4,n3,n2)

Time to enter a search term

# enter SEARCH Word
word <- 'good '

#
matches <- c()
for (sentence in n_all) {
    # find exact match with double backslash and escape
    if (grepl(paste0('\\<',word), sentence)) {
        print(sentence)
        matches <- c(matches, sentence)
    }
}

# find highest probability word
precision_match <- c()
for (a_match in matches) {
    # how many spaces in from of search word
    precision_match <- c(precision_match,nchar(strsplit(x = a_match, split = word)[[1]][[1]]))
}

The last step returns all ngrams, which contain our search word from line 29.

Now I want to remove all sentence which don't start with search word we have entered.

For example "precision_match" returns:

[1] search_word wordX wordY wordZ
[2] search_word wordY wordX wordZ
[3] wordY search_word wordX wordZ
[4] wordY wordX wordZ search_word

Of course I could manually select [1] und [2] since I can see that these two lines start with our search_word. But this isn't practical with a big number of matches. So how can I extract the n-grams starting with our search_word?

Upvotes: 0

Views: 430

Answers (1)

CER
CER

Reputation: 889

I run your code and don't know why you use the last part with the precision_match... It basically gives you the difference in positions of the search word from the beginning of the string. Your question however seems to be tackled at a step earlier (matches).

Try

set.seed(33)
test <- rcorpus(nwords = 50, alphabet = letters, minwordlen = 1, maxwordlen = 6)

Get_Ngrams <- function(sentence_splits, ngram_size=2) {
  ngrams <- c()
  for (sentence in sentence_splits) {
    sentence <- trimws(sentence)
    if ((nchar(sentence) > 0) && (sapply(gregexpr("\\W+", sentence), length) >= 
                              ngram_size)) {
      ngs <- ngram(sentence , n=ngram_size)
      ngrams <- c(ngrams, get.ngrams(ngs))
    }
  }
  return (ngrams)
}

ngram2 <- Get_Ngrams(test,2)
ngram3 <- Get_Ngrams(test,3)
n_all <- c(ngram2,ngram3)

word <- 'ghpbw'

matches <- c()
for (sentence in n_all) {
  # find exact match with double backslash and escape
  if (grepl(paste0('\\<',word), sentence)) {
    print(sentence)
    matches <- c(matches, sentence)
  }
}



matches[grepl(pattern = paste0("^",word), matches)]

which results in the ngrams starting with the search word ghpbw [1] "ghpbw zbaiou" "ghpbw zbaiou ffrpj" discarding [1] "wil ghpbw" "dxjv wil ghpbw" "wil ghpbw zbaiou"

Upvotes: 0

Related Questions