rjss
rjss

Reputation: 1013

Get href property for each row in a table using rvest

I am trying to extract all links for a table that looks similar to the following:

<!DOCTYPE html>
<html>
<body>

<table>
  <tr>
    <td>
      <a href="https://www.r-project.org/">R</a><br>
      <a href="https://www.rstudio.com/">RStudio</a>
    </td>
  </tr>
  <tr>
    <td>
      <a href="https://community.rstudio.com/">Rstudio Community</a>
    </td>
  </tr>
</table>

</body>
</html>

What I would like to do is to get a list of dataframes (or vector) at the end where each dataframe contain all the links for each row in the html table. For example, in this case the list will have vector 1 with c("https://www.r-project.org/","https://www.rstudio.com/") and the second vector will be c("https://community.rstudio.com/"). The main problem I am having right now is that I am not able to keep the href relationship to each node when I do the following:

library(rvest)

web <- read_html("table.html") %>%
  html_nodes("table") %>%
  html_nodes("tr") %>%
  html_nodes("a") %>%
  html_attr("href")

Upvotes: 2

Views: 3020

Answers (1)

Andrew Gustar
Andrew Gustar

Reputation: 18425

One way would be to add in a search replacing the "a" term with html_node, which will generate a list of just the first url in each tr. You could then use this to split the full list into groups.

page <- read_html("table.html") #just read the html once

web <- page %>%
  html_nodes("table") %>% html_nodes("tr") %>% html_nodes("a") %>%
  html_attr("href") #as above

web2 <- page %>%
  html_nodes("table") %>% html_nodes("tr") %>% html_node("a") %>%
  html_attr("href") #just the first url in each tr

webdf <- data.frame(web=web, #full list
                    group=cumsum(web %in% web2), #grouping indicator by tr
                    stringsAsFactors=FALSE)

webdf
                             web group
1     https://www.r-project.org/     1
2       https://www.rstudio.com/     1
3 https://community.rstudio.com/     2

Upvotes: 7

Related Questions