MattV
MattV

Reputation: 1383

rvest & parsing HTML: find a list of items, and extract specific information from each item

I'm struggling to find neat code to do the following:

Example HTML:

<div class="i-am-a-list">
    <div class="item item-one"><a href=""></a><a class="title"></a><p>sub-title</p></div>
    <div class="item item-two"><a href=""></a><a class="title-two"></a><p>sub-title</p></div>
    <div class="item item-three"><a href=""></a><a class="title-three"></a><p>sub-title</p></div>
    <div class="item item-four"><a href=""></a><a class="title-for"></a><p>sub-title</p></div>
    <div class="item item-five"><a href=""></a><a class="title-five"></a><p>sub-title</p></div>

</div>

Code thus far:

# find the upper list   
coll <- read_html(doc.html) %>%
  html_node('.i-am-a-list') %>%
  html_nodes(".item")

# problems here, how do I iterate over the returned divs 
# I was expecting something like
results <- coll %>% 
    do(parse_a_single_item) %>%
    rbind_all()

Would it be possible to write such pretty code to do such a common task? :)

Upvotes: 0

Views: 1125

Answers (1)

GGamba
GGamba

Reputation: 13680

It's not really pretty and I feel like I'm missing some obvious method, but you can do:

library(rvest)
library(purrr)

read_html(x) %>%
  html_node('.i-am-a-list') %>%
  html_nodes(".item") %>% 
  map_df(~{
    class = html_attr(.x, 'class')
    a1 = html_nodes(.x, 'a') %>% '['(1) %>% html_attr('href')
    a2 = html_nodes(.x, 'a') %>% '['(2) %>% html_attr('class')
    # or with CSS selector
    # a1 = html_nodes(.x, 'a:first-child') %>% html_attr('href')
    # a2 = html_nodes(.x, 'a:nth-child(2)') %>% html_attr('class')
    p = html_nodes(.x, 'p') %>% html_text()
    data.frame(class, a1, a2, p)
    })

#             class a1          a2         p
# 1   item item-one          title sub-title
# 2   item item-two      title-two sub-title
# 3 item item-three    title-three sub-title
# 4  item item-four      title-for sub-title
# 5  item item-five     title-five sub-title

data:

x <- '<div class="i-am-a-list">
  <div class="item item-one"><a href=""></a><a class="title"></a><p>sub-title</p></div>
  <div class="item item-two"><a href=""></a><a class="title-two"></a><p>sub-title</p></div>
  <div class="item item-three"><a href=""></a><a class="title-three"></a><p>sub-title</p></div>
  <div class="item item-four"><a href=""></a><a class="title-for"></a><p>sub-title</p></div>
  <div class="item item-five"><a href=""></a><a class="title-five"></a><p>sub-title</p></div>
</div>'

Upvotes: 2

Related Questions