Reputation: 23
Is it possible to scrape only text with specific styles with rvest?
Example HTML:
<p>Lorem ipsum <span style="font-size: 15px">dolor</span> sit amet, <span style="font-size: 15px">consetetur</span> sadipscing <span style="font-weight: 400">elitr</span>, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam.</p>
I want to scrape only the text with font-size: 15px
, but not the text within other <span>
-tags.
One workaround I have tried is:
html %>%
html_nodes('span') %>%
str_subset('font-size: 15px')
However, it's not possible to use html_text
after str_subset
as it converts the html to strings. Any other ideas besides erasing the remaining tags manually?
Upvotes: 2
Views: 956
Reputation: 24079
Look up the html_attr
and html_attrs
functions in the rvest package.
This example will find the nodes with the attribute you are looking for:
library(rvest)
html<-read_html('<p>Lorem ipsum <span style="font-size: 15px">dolor</span> sit amet, <span style="font-size: 15px">consetetur</span> sadipscing <span style="font-weight: 400">elitr</span>, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam.</p>')
nodes<-html %>% html_nodes('span')
nodes[html_attr(nodes, "style")=="font-size: 15px"]
#{xml_nodeset (2)}
#[1] <span style="font-size: 15px">dolor</span>
#[2] <span style="font-size: 15px">consetetur</span>
Upvotes: 2