adrian1121
adrian1121

Reputation: 914

How can I look for specific sentences inside a text in R?

I have a dataset which is plenty of people offering themselves to some jobs.The point is that I want to retrieve from each comment some very specific sentences I have in a .txt file. So far I haven't managed to do it properly.

score.sentiment <- function(sentences, pos.words, .progress='none')
{
  require(plyr)
  require(stringr)
  scores <- laply(sentences, function(sentence, pos.words){
sentence <- gsub('[[:punct:]]', "", sentence)
    sentence <- gsub('[[:cntrl:]]', "", sentence)
    sentence <- gsub('\\d+', "", sentence)
    sentence <- tolower(sentence)
    word.list <- str_split(sentence, '\\s+')
    words <- unlist(word.list)
     pos.matches <- match(words, pos.words)
     score <- pos.matches
    return(score)
  }, pos.words, .progress=.progress)
  scores.df <- data.frame(text=sentences)
  return(scores.df)
}
results <- score.sentiment(sentences = serv$service_description, pos.words)

The text file is called pos.words and it contains sentences in spanish such that:

 tengo 25 años
 tengo 47 años
 tengo 34 años

The other file contains a variable called services which contains a comment per person explaining their abilities, their education and so on. And what I'd like to do is to get their age from the text they have written.

Example from services file:

"Me llamo Adrián y tengo 24 años. He estudiado Data Science y me gusta trabajar en el sector tecnológico"

So from this sample I'd like to get my age. My idea so far has been to create a pos.words.txt with all the possible sentences in spanish stating the age and matching it with the comments file.

The main problems that have arisen so far are that I can't create a correct function to do it; I don't know how to make R to identify whole sentences from pos.words.txt because for the moment it takes every single word as a character. In addition to this, the piece of code I have posted here explaining my function doesn't work (thug life...)

I'd really appreciate some help to tackle this issue!!

Thank you very much for your help!!

Adrian

Upvotes: 0

Views: 207

Answers (1)

IRTFM
IRTFM

Reputation: 263332

This splits into sentences and captures the last instance of `"tengo años":

inp <- "blah blah blah tengo 25 años more blah.
  Even more blha then tengo 47 años.
  Me llamo Adrián y tengo 34 años."
rl <- readLines(textConnection(inp))  # might need to split at periods
     # Then use a capture class to get the digits flanked by "tengo" and "años"
gsub("^.+tengo[ ](\\d+)[ ]años.+$", "\\1", rl)
[1] "25" "47" "34"

Upvotes: 1

Related Questions