Xpath removing children

Question

I am trying to webscrape a forum using R, RVest. However I'm having a little trouble trying to remove children from my xml_nodeset. So the html I'm trying to webscrape is as followed:


 
  Quote:
  
    
        
            
                Originally Posted by John Doe
            
                This is the post inside the quote
        
    
  
 
  This is the post outside the quote

What I need from this piece of html is 'This is the post outside the quote', which is the original post. And the piece I dont want is the quoted post inside the class "alt2", 'This is the post inside the quote'.
Further more there are multiple post_messages on each page. And there can be multiple quotes in each post_message.
The code I'm using right now is able to get all the text inside each post. But is also containing the text which is inside the quote (something I do not want).

link %>%
   read_html() %>%
   html_nodes(xpath = '//*[contains(@id, "post_message_")]') %>%
   html_text()

How can I get only the text outside the quote ('This is the post outside the quote'), preferably using xpath?

Xpath removing children

Answers (1)

Related Questions