Reputation: 914
I have a dataset which is plenty of people offering themselves to some jobs.The point is that I want to retrieve from each comment some very specific sentences I have in a .txt file. So far I haven't managed to do it properly.
score.sentiment <- function(sentences, pos.words, .progress='none')
{
require(plyr)
require(stringr)
scores <- laply(sentences, function(sentence, pos.words){
sentence <- gsub('[[:punct:]]', "", sentence)
sentence <- gsub('[[:cntrl:]]', "", sentence)
sentence <- gsub('\\d+', "", sentence)
sentence <- tolower(sentence)
word.list <- str_split(sentence, '\\s+')
words <- unlist(word.list)
pos.matches <- match(words, pos.words)
score <- pos.matches
return(score)
}, pos.words, .progress=.progress)
scores.df <- data.frame(text=sentences)
return(scores.df)
}
results <- score.sentiment(sentences = serv$service_description, pos.words)
The text file is called pos.words and it contains sentences in spanish such that:
tengo 25 años
tengo 47 años
tengo 34 años
The other file contains a variable called services which contains a comment per person explaining their abilities, their education and so on. And what I'd like to do is to get their age from the text they have written.
Example from services file:
"Me llamo Adrián y tengo 24 años. He estudiado Data Science y me gusta trabajar en el sector tecnológico"
So from this sample I'd like to get my age. My idea so far has been to create a pos.words.txt with all the possible sentences in spanish stating the age and matching it with the comments file.
The main problems that have arisen so far are that I can't create a correct function to do it; I don't know how to make R to identify whole sentences from pos.words.txt because for the moment it takes every single word as a character. In addition to this, the piece of code I have posted here explaining my function doesn't work (thug life...)
I'd really appreciate some help to tackle this issue!!
Thank you very much for your help!!
Adrian
Upvotes: 0
Views: 207
Reputation: 263332
This splits into sentences and captures the last instance of `"tengo años":
inp <- "blah blah blah tengo 25 años more blah.
Even more blha then tengo 47 años.
Me llamo Adrián y tengo 34 años."
rl <- readLines(textConnection(inp)) # might need to split at periods
# Then use a capture class to get the digits flanked by "tengo" and "años"
gsub("^.+tengo[ ](\\d+)[ ]años.+$", "\\1", rl)
[1] "25" "47" "34"
Upvotes: 1