Reputation: 65
I have the multiple rows of text data(different document) and each row has around 60-70 lines of text data(more than 50000 characters). But of these my area of interest is only on 1-2 rows of data, based on keywords. I want to extract only those sentences where the keyword/group of words are present. My hypothesis is that by extracting only that piece of information, I can have a better POS tagging and understand sentence context better as I am only looking at sentence that I need. Is my understanding correct and how can we accomplish this in R apart from using regex and full stops. This might be computationally intensive.
Eg: The Boy lives in Miami and studies in the st. Martin School.The boy has a heiht of 5.7" and weights 60 Kg's. He has intrest in the Arts and crafts; and plays basketball.............................................. ..................................................................
I just want to extract the sentence "The Boy lives in Miami and studies in the st. Martin School" based on key word study (stemmed keyword).
Upvotes: 0
Views: 1931
Reputation: 41
For this example, I have used three packages: NLP and openNLP (for sentence split) and SnowballC (for lemmatize). I did not use the tokenizers package mentioned above because I did not know it. And the packages I mentioned are part of the Apache OpenNLP toolkit, well known and used by the community.
First, use the code below to install the packages mentioned. If you have the packages installed, skip to the next step:
## List of used packages
list.of.packages <- c("NLP", "openNLP", "SnowballC")
## Returns a not installed packages list
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
## Installs new packages
if(length(new.packages))
install.packages(new.packages)
Next, load used packages:
library(NLP)
library(openNLP)
library(SnowballC)
Next, convert the text to a string (NLP package function). This is necessary because the openNLP package works with the String type. In this example, I used the same text that you provided in your question:
example_text <- paste0("The Boy lives in Miami and studies in the St. Martin School. ",
"The boy has a heiht of 5.7 and weights 60 Kg's. ",
"He has intrest in the Arts and crafts; and plays basketball. ")
example_text <- as.String(example_text)
#output
> example_text
The Boy lives in Miami and studies in the St. Martin School. The boy has a heiht of 5.7 and weights 60 Kg's. He has intrest in the Arts and crafts; and plays basketball.
Next, we use the openNLP package to generate a sentence annotator that computes the annotations through a sentence detector:
sent_annotator <- Maxent_Sent_Token_Annotator()
annotation <- annotate(example_text, sent_annotator)
Next, through the notes made in the text, we can extract the sentences:
splited_text <- example_text[annotation]
#output
splited_text
[1] "The Boy lives in Miami and studies in the St. Martin School."
[2] "The boy has a heiht of 5.7 and weights 60 Kg's. "
[3] "He has intrest in the Arts and crafts; and plays basketball. "
Finally, we use the wordStem function of the SnowballC package that has support for the English language. This function reduces a word or a vector of words to its radical (common base form). Next, we use the grep function of the base package R to find the sentences that contain the keywords we are looking for:
stemmed_keyword <- wordStem ("study", language = "english")
sentence_index<-grep(stemmed_keyword, splited_text)
#output
splited_text[sentence_index]
[1] "The Boy lives in Miami and studies in the St. Martin School."
Note
Note that I have changed the example text you provided from ** "... st. Martin School." ** to ** "... St. Martin School." **. If the letter "s" remained lowercase, the sentence detector would understand that punctuation in "st." is an end point. And the vector with the splited sentences would be as follows:
> splited_text
[1] "The Boy lives in Miami and studies in the st." "Martin School."
[3] "The boy has a heiht of 5.7 and weights 60 Kg's." "He has intrest in the Arts and crafts; and plays basketball."
And consequently when checking your keyword in this vector, your output would be:
> splited_text[sentence_index]
[1] "The Boy lives in Miami and studies in the st."
I also tested the tokenizers package mentioned above and also have this same problem. Therefore, notice that this is an open problem in NLP annotation tasks. However, the above logic and algorithm works correctly.
I hope this helps.
Upvotes: 4
Reputation: 18440
For each document, you could first apply SnowballC::wordStem
to lemmatize, and then use tokenizers::tokenize_sentences
to split the document. Now you could use grepl
to find the sentences that contain the keywords you are looking for.
Upvotes: 0