Conversation mining processing

Question

I'm currently trying to scrape some data from a website in order to proceed to a lexical analysis. I'm pretty new to data mining with R and I've been using it only for college-related issues so sorry for being a bit nooby. I'm trying to scrape text from a website as mentioned. To do so, I used the following command:

scraping_jst <- read_html(url)
p_text <- scraping_jst %>%
html_nodes("p") %>%
text <- html_text()

So I have now all the paragraphs in the text object. As most of these documents are dialogues I would like to keep only the lines that are from a certain person. Example :

Alice: Hello
Paul: How are you doing ?

I would like to find a way to select and extract only Paul's part (for example). I've tried to use the strsplit() function like this :

 test <- strsplit(p_text, ":")

But I'm a bit lost with the results...

Can someone help me ?

Julian Zucker · Accepted Answer

Probably the best way to do this is to break down each line of text. Once you have each line where people said things, with their name in the beginning, you can use

str_extract(a, "(?<=Paul: ).*")

To extract everything after "Paul: " in each of those lines.

Conversation mining processing

Answers (2)

Related Questions