Onurb
Onurb

Reputation: 13

Conversation mining processing

I'm currently trying to scrape some data from a website in order to proceed to a lexical analysis. I'm pretty new to data mining with R and I've been using it only for college-related issues so sorry for being a bit nooby. I'm trying to scrape text from a website as mentioned. To do so, I used the following command:

scraping_jst <- read_html(url)
p_text <- scraping_jst %>%
html_nodes("p") %>%
text <- html_text()

So I have now all the paragraphs in the text object. As most of these documents are dialogues I would like to keep only the lines that are from a certain person. Example :

I would like to find a way to select and extract only Paul's part (for example). I've tried to use the strsplit() function like this :

 test <- strsplit(p_text, ":")

But I'm a bit lost with the results...

Can someone help me ?

Upvotes: 1

Views: 55

Answers (2)

moodymudskipper
moodymudskipper

Reputation: 47320

Assuming your text input is an array containing strings formatted as in your example

text_array <- c(
  "Alice: Hello",
  "Paul: How are you doing ?",
  "Alice: Good, you ?",
  "Paul: Awesome: thx"
)
speaker <- "Paul"

parsed_lines <- sapply(text_array,
      function(txt){
        delimiter_pos <- regexpr(":",txt)[[1]] # returns the position of first semi column, you may have to deal with exceptions, like chapter names and other irregular lines
        speaker <-substr(txt,1,delimiter_pos-1) # text before delimiter
        speaker_line <- substr(txt,delimiter_pos+1,nchar(txt)) # text after delimiter
        return(list(speaker,speaker_line))
        })

parsed_df <- as.data.frame(matrix(unlist(parsed_lines),ncol=2,byrow=TRUE,dimnames=list(NULL,c("speaker","speaker_line")))) # reformat as a 2 columns data.frame, as parsed_lines was a list

parsed_df
#   speaker         speaker_line
# 1   Alice                Hello
# 2    Paul  How are you doing ?
# 3   Alice          Good, you ?
# 4    Paul         Awesome: thx

# Paul's lines
subset(parsed_df,speaker == "Paul")

Upvotes: 0

Julian Zucker
Julian Zucker

Reputation: 564

Probably the best way to do this is to break down each line of text. Once you have each line where people said things, with their name in the beginning, you can use

str_extract(a, "(?<=Paul: ).*")

To extract everything after "Paul: " in each of those lines.

Upvotes: 1

Related Questions