John Doe
John Doe

Reputation: 13

Xpath removing children

I am trying to webscrape a forum using R, RVest. However I'm having a little trouble trying to remove children from my xml_nodeset. So the html I'm trying to webscrape is as followed:

<div id="post_message_1234">
 <div style="margin:20px; margin-top:5px; ">
  <div class="smallfont" style="margin-bottom:2px">Quote:</div>
  <table cellpadding="6" cellspacing="0" border="0" width="100%">
    <tr>
        <td class="alt2" style="border:1px inset">
            <div>
                Originally Posted by <strong>John Doe</strong>
            </div>
                This is the post inside the quote
        </td>
    </tr>
  </table>
 </div>
  This is the post outside the quote
</div>

What I need from this piece of html is 'This is the post outside the quote', which is the original post. And the piece I dont want is the quoted post inside the class "alt2", 'This is the post inside the quote'.
Further more there are multiple post_messages on each page. And there can be multiple quotes in each post_message.
The code I'm using right now is able to get all the text inside each post. But is also containing the text which is inside the quote (something I do not want).

link %>%
   read_html() %>%
   html_nodes(xpath = '//*[contains(@id, "post_message_")]') %>%
   html_text()

How can I get only the text outside the quote ('This is the post outside the quote'), preferably using xpath?

Upvotes: 0

Views: 699

Answers (1)

GermanC
GermanC

Reputation: 279

How about removing the child DIV?

link %>%
   read_html() %>%
   html_nodes(xpath = '//*[contains(@id, "post_message_")]/node()[not(self::div)]') %>%
   html_text()

Check out this working example for imbd that I tested using this compiler

read_html('http://www.imdb.com/title/tt1490017/') %>%
    html_nodes(xpath = '//div[@class="originalTitle"]/node()[not(self::span)]') %>%
    html_text()

I'm just getting "The LEGO Movie" as output which is what you need

Upvotes: 1

Related Questions