Reputation: 1013
I am trying to extract all links for a table that looks similar to the following:
<!DOCTYPE html>
<html>
<body>
<table>
<tr>
<td>
<a href="https://www.r-project.org/">R</a><br>
<a href="https://www.rstudio.com/">RStudio</a>
</td>
</tr>
<tr>
<td>
<a href="https://community.rstudio.com/">Rstudio Community</a>
</td>
</tr>
</table>
</body>
</html>
What I would like to do is to get a list of dataframes (or vector) at the end where each dataframe contain all the links for each row in the html table. For example, in this case the list will have vector 1 with c("https://www.r-project.org/","https://www.rstudio.com/")
and the second vector will be c("https://community.rstudio.com/")
. The main problem I am having right now is that I am not able to keep the href relationship to each node when I do the following:
library(rvest)
web <- read_html("table.html") %>%
html_nodes("table") %>%
html_nodes("tr") %>%
html_nodes("a") %>%
html_attr("href")
Upvotes: 2
Views: 3020
Reputation: 18425
One way would be to add in a search replacing the "a"
term with html_node
, which will generate a list of just the first url in each tr
. You could then use this to split the full list into groups.
page <- read_html("table.html") #just read the html once
web <- page %>%
html_nodes("table") %>% html_nodes("tr") %>% html_nodes("a") %>%
html_attr("href") #as above
web2 <- page %>%
html_nodes("table") %>% html_nodes("tr") %>% html_node("a") %>%
html_attr("href") #just the first url in each tr
webdf <- data.frame(web=web, #full list
group=cumsum(web %in% web2), #grouping indicator by tr
stringsAsFactors=FALSE)
webdf
web group
1 https://www.r-project.org/ 1
2 https://www.rstudio.com/ 1
3 https://community.rstudio.com/ 2
Upvotes: 7