Reputation: 13
I am trying to webscrape a forum using R, RVest. However I'm having a little trouble trying to remove children from my xml_nodeset. So the html I'm trying to webscrape is as followed:
<div id="post_message_1234">
<div style="margin:20px; margin-top:5px; ">
<div class="smallfont" style="margin-bottom:2px">Quote:</div>
<table cellpadding="6" cellspacing="0" border="0" width="100%">
<tr>
<td class="alt2" style="border:1px inset">
<div>
Originally Posted by <strong>John Doe</strong>
</div>
This is the post inside the quote
</td>
</tr>
</table>
</div>
This is the post outside the quote
</div>
What I need from this piece of html is 'This is the post outside the quote', which is the original post. And the piece I dont want is the quoted post inside the class "alt2", 'This is the post inside the quote'.
Further more there are multiple post_messages on each page. And there can be multiple quotes in each post_message.
The code I'm using right now is able to get all the text inside each post. But is also containing the text which is inside the quote (something I do not want).
link %>%
read_html() %>%
html_nodes(xpath = '//*[contains(@id, "post_message_")]') %>%
html_text()
How can I get only the text outside the quote ('This is the post outside the quote'), preferably using xpath?
Upvotes: 0
Views: 699
Reputation: 279
How about removing the child DIV?
link %>%
read_html() %>%
html_nodes(xpath = '//*[contains(@id, "post_message_")]/node()[not(self::div)]') %>%
html_text()
Check out this working example for imbd that I tested using this compiler
read_html('http://www.imdb.com/title/tt1490017/') %>%
html_nodes(xpath = '//div[@class="originalTitle"]/node()[not(self::span)]') %>%
html_text()
I'm just getting "The LEGO Movie" as output which is what you need
Upvotes: 1