Reputation: 13
I'm currently trying to scrape some data from a website in order to proceed to a lexical analysis. I'm pretty new to data mining with R and I've been using it only for college-related issues so sorry for being a bit nooby. I'm trying to scrape text from a website as mentioned. To do so, I used the following command:
scraping_jst <- read_html(url)
p_text <- scraping_jst %>%
html_nodes("p") %>%
text <- html_text()
So I have now all the paragraphs in the text object. As most of these documents are dialogues I would like to keep only the lines that are from a certain person. Example :
I would like to find a way to select and extract only Paul's part (for example). I've tried to use the strsplit() function like this :
test <- strsplit(p_text, ":")
But I'm a bit lost with the results...
Can someone help me ?
Upvotes: 1
Views: 55
Reputation: 47320
Assuming your text input is an array containing strings formatted as in your example
text_array <- c(
"Alice: Hello",
"Paul: How are you doing ?",
"Alice: Good, you ?",
"Paul: Awesome: thx"
)
speaker <- "Paul"
parsed_lines <- sapply(text_array,
function(txt){
delimiter_pos <- regexpr(":",txt)[[1]] # returns the position of first semi column, you may have to deal with exceptions, like chapter names and other irregular lines
speaker <-substr(txt,1,delimiter_pos-1) # text before delimiter
speaker_line <- substr(txt,delimiter_pos+1,nchar(txt)) # text after delimiter
return(list(speaker,speaker_line))
})
parsed_df <- as.data.frame(matrix(unlist(parsed_lines),ncol=2,byrow=TRUE,dimnames=list(NULL,c("speaker","speaker_line")))) # reformat as a 2 columns data.frame, as parsed_lines was a list
parsed_df
# speaker speaker_line
# 1 Alice Hello
# 2 Paul How are you doing ?
# 3 Alice Good, you ?
# 4 Paul Awesome: thx
# Paul's lines
subset(parsed_df,speaker == "Paul")
Upvotes: 0
Reputation: 564
Probably the best way to do this is to break down each line of text. Once you have each line where people said things, with their name in the beginning, you can use
str_extract(a, "(?<=Paul: ).*")
To extract everything after "Paul: "
in each of those lines.
Upvotes: 1