Reputation: 223
I have a URL (https://forums.vwvortex.com/showthread.php?8829402-Atlas-V6-Oil-Change-Routine)
to scrape the posts from. Some of these posts are replies which has initial text as "Originally Posted by ...". I want to scrape all the data within the posts excluding the initial Originally posted by text. For example,
User df_text
A Hi, how are you ?
B This is beautiful!
C Heuwi
D Originally posted by C Heuwi
Hellou
E Hello guys
F Originally posted by A Hi, how are you ?
I am doing good
G Whats going on ?
For user D, "Originally Posted by.." is under div.quote_container class (child class) and "I am doing good" is under blockquote.postcontent.restore, which is parent class.
Expected results:
User df_text
A Hi, how are you ?
B This is beautiful!
C Heuwi
D Hellou
E Hello guys
F I am doing good
G Whats going on ?
I tried the following code:
url<-"https://forums.vwvortex.com/showthread.php?8829402-Atlas-V6-Oil-Change-Routine"
review <- read_html(url)
threads<- cbind(review %>% html_nodes("blockquote.postcontent.restore:not(.quote_container)") %>% html_text())
Tried few other ones too:
threads <- cbind(review %>% html_nodes(xpath = '//div[@class="blockquote.postcontent.restore"]/node()[not(self::div)]') %>% html_text())
or
threads <- review %>% html_nodes(".content")
close_nodes <- threads %>% html_nodes(".quote_container")
chk <- xml_remove(close_nodes)
None of these worked. Please help me to find a way to scrape all the posts data excluding child class. Thanks in advance!!
Upvotes: 1
Views: 1126
Reputation: 24089
This turns out to be a relevantly easy solution by using the xml_remove
function which is a part of the xml2 library (loaded automatically with rvest)
library(rvest)
#read page
url<-"https://forums.vwvortex.com/showthread.php?8829402-Atlas-V6-Oil-Change-Routine"
review <- read_html(url)
#find parent nodes
threads<- review %>% html_nodes("blockquote.postcontent.restore:not(.quote_container)")
#find children nodes to exclude
toremove<-threads %>% html_node("div.bbcode_container")
#remove nodes
xml_remove(toremove)
#convert the parent nodes to text
threads %>% html_text(trim=TRUE)
From the documentation for xml_remove
: "Care needs to be taken when using xml_remove()". Please review, use caution and save frequently.
Upvotes: 2