Counting the number of HTML attributes on a webpage

Question

I need to count a varying number of specific HTML attributes on multiple webpages. I will then use that number to scrape the data I need contained in those href attributes.

page_a<-"http://www.ufcstats.com/statistics/fighters?char=a&page=all"
fighter_links<-read_html(page_a)%>%
  html_nodes("tr")%>%
  html_nodes("td")%>%
  html_nodes("a")%>%
  html_attr("href")%>%
  .[seq(1,1500,3)]%>%
  na.omit(fighter_links)

The purpose of the above code is to read the HTML on that page and extract the links I need. This is only one of 26 webpages I need to scrape, all with a varying number of links I need to scrape from each one.

The problem I have in the above code is that I want to count the number of href attributes in those specified nodes. Another solution would be to count the number of a or td nodes, if that makes a difference. I want to then use that number and insert it in place of the 1500 within .[seq(1,1500,3)].

I know the seq() above is incredibly crude; however, that is the only solution I can currently put into practice. I plan on using this program for a long time, and the pages I'm scraping will constantly add new information, so I don't want to hard code anything; hence my wanting to fix the problem.

Ronak Shah · Accepted Answer

You can use the length function to count number of href attributes.

library(rvest)

page_a<-"http://www.ufcstats.com/statistics/fighters?char=a&page=all"

fighter_links <- read_html(page_a)%>%
                  html_nodes("tr td a")%>%
                 html_attr("href") %>%
                .[seq(1,length(.),3)]%>%
                 #You can also use recycling method
                 #.[c(TRUE, FALSE, FALSE)]%>%
                 na.omit(fighter_links)

Counting the number of HTML attributes on a webpage

Answers (1)

Related Questions