jtr13
jtr13

Reputation: 1277

How can I conditionally select attributes from html nodes with rvest?

Is there a way to use OR with html_attr()? In this MRE, I only want the nodes with "drink" or "food" attributes.

That is, with the following data, I'd like to do something like mydata %>% html_nodes("mynode") %>% html_attr("drink" or "food" otherwise skip), and get:

[1] "tea"    "coffee" "egg"    "toast" 

> mydata
{xml_document}
<allitems>
[1] <mynode drink="tea"/>
[2] <mynode dessert="cookie"/>
[3] <mynode drink="coffee"/>
[4] <mynode spice="pepper"/>
[5] <mynode food="egg"/>
[6] <mynode food="toast"/>

Can I do this without pulling out the drink and food attributes separately, combining the vectors, and removing NAs?

Upvotes: 1

Views: 2648

Answers (1)

Carl Boneri
Carl Boneri

Reputation: 2722

I'm going to suggest using the xml2 package, which is a dependency of rvest I believe.

Making reproducible by coercing to HTML with package::htmltools

a <- htmltools::HTML(
     '<mynode drink="tea"/>
      <mynode dessert="cookie"/>
      <mynode drink="coffee"/>
      <mynode spice="pepper"/>
      <mynode food="egg"/>
      <mynode food="toast"/>')

Now using an xpath selector we can extract all nodes with an attribute or food or drink.

> read_html(a) %>% xml_find_all('//*[@food or @drink]')
{xml_nodeset (4)}
[1] <mynode drink="tea"></mynode>
[2] <mynode drink="coffee"></mynode>
[3] <mynode food="egg"></mynode>
[4] <mynode food="toast"></mynode>

To get to the attribute values:

> read_html(a) %>% xml_find_all('//*[@food or @drink]') %>% 
     xml_attrs() %>% unlist(use.names = FALSE)
[1] "tea"    "coffee" "egg"    "toast"

Upvotes: 3

Related Questions