Reputation: 83
I need to count a varying number of specific HTML attributes on multiple webpages. I will then use that number to scrape the data I need contained in those href
attributes.
page_a<-"http://www.ufcstats.com/statistics/fighters?char=a&page=all"
fighter_links<-read_html(page_a)%>%
html_nodes("tr")%>%
html_nodes("td")%>%
html_nodes("a")%>%
html_attr("href")%>%
.[seq(1,1500,3)]%>%
na.omit(fighter_links)
The purpose of the above code is to read the HTML on that page and extract the links I need. This is only one of 26 webpages I need to scrape, all with a varying number of links I need to scrape from each one.
The problem I have in the above code is that I want to count the number of href
attributes in those specified nodes. Another solution would be to count the number of a
or td
nodes, if that makes a difference. I want to then use that number and insert it in place of the 1500 within .[seq(1,1500,3)]
.
I know the seq()
above is incredibly crude; however, that is the only solution I can currently put into practice. I plan on using this program for a long time, and the pages I'm scraping will constantly add new information, so I don't want to hard code anything; hence my wanting to fix the problem.
Upvotes: 1
Views: 244
Reputation: 389065
You can use the length
function to count number of href
attributes.
library(rvest)
page_a<-"http://www.ufcstats.com/statistics/fighters?char=a&page=all"
fighter_links <- read_html(page_a)%>%
html_nodes("tr td a")%>%
html_attr("href") %>%
.[seq(1,length(.),3)]%>%
#You can also use recycling method
#.[c(TRUE, FALSE, FALSE)]%>%
na.omit(fighter_links)
Upvotes: 1