code_learner
code_learner

Reputation: 223

Is it possible to scrape data excluding child class within html node using Rvest?

I have a URL (https://forums.vwvortex.com/showthread.php?8829402-Atlas-V6-Oil-Change-Routine) to scrape the posts from. Some of these posts are replies which has initial text as "Originally Posted by ...". I want to scrape all the data within the posts excluding the initial Originally posted by text. For example,

User  df_text
 A    Hi, how are you ?
 B    This is beautiful!
 C    Heuwi
 D    Originally posted by C Heuwi 
      Hellou
 E    Hello guys
 F    Originally posted by A Hi, how are you ?
      I am doing good
 G    Whats going on ?

For user D, "Originally Posted by.." is under div.quote_container class (child class) and "I am doing good" is under blockquote.postcontent.restore, which is parent class.

Expected results:

User  df_text
 A    Hi, how are you ?
 B    This is beautiful!
 C    Heuwi
 D    Hellou
 E    Hello guys
 F    I am doing good
 G    Whats going on ?

I tried the following code:

url<-"https://forums.vwvortex.com/showthread.php?8829402-Atlas-V6-Oil-Change-Routine"
review <- read_html(url)
threads<- cbind(review %>% html_nodes("blockquote.postcontent.restore:not(.quote_container)") %>% html_text())

Tried few other ones too:

threads <- cbind(review %>% html_nodes(xpath = '//div[@class="blockquote.postcontent.restore"]/node()[not(self::div)]') %>% html_text())

or

threads <- review %>% html_nodes(".content")
close_nodes <- threads %>% html_nodes(".quote_container")
chk <- xml_remove(close_nodes)

None of these worked. Please help me to find a way to scrape all the posts data excluding child class. Thanks in advance!!

Upvotes: 1

Views: 1126

Answers (1)

Dave2e
Dave2e

Reputation: 24089

This turns out to be a relevantly easy solution by using the xml_remove function which is a part of the xml2 library (loaded automatically with rvest)

library(rvest)
#read page
url<-"https://forums.vwvortex.com/showthread.php?8829402-Atlas-V6-Oil-Change-Routine"
review <- read_html(url)

#find parent nodes
threads<- review %>% html_nodes("blockquote.postcontent.restore:not(.quote_container)")
#find children nodes to exclude
toremove<-threads %>% html_node("div.bbcode_container")
#remove nodes
xml_remove(toremove)

#convert the parent nodes to text
threads %>% html_text(trim=TRUE)

From the documentation for xml_remove: "Care needs to be taken when using xml_remove()". Please review, use caution and save frequently.

Upvotes: 2

Related Questions