av abhishiek
av abhishiek

Reputation: 667

Retrieving table data from html doc in R

I am trying to retrieve brand data for the most powerful brands from http://www.forbes.com/powerful-brands/list/#tab:rank. When I initially failed to retrieve data using getURL and `HtmlParse I understood that the table data is coming from some other link. So to make things easy I downloaded the html page and tried to retrieve the data.
I initially tried using

library(XML)
library(RCurl)
library(ggplot2)
forbes <- readHTMLTable("forbes.html",header = TRUE,as.data.frame = TRUE)
forbes

Now when I display forbes I get a list. I had though I would get a dataframe instead.

I checked in the list to find data of the top 10 brands in forbes$the_list, but did not find the rest of the data of the rest of the companies. i.e. beyond top 10 companies.

How can I retrieve all the tabular data from the forbes page and how can I convert it to a data frame for my manipulation.

Please let me know if you need any further info.

Upvotes: 1

Views: 106

Answers (1)

Emmanuel Hamel
Emmanuel Hamel

Reputation: 2223

I have been able to extract all the table with the following code :

library(RSelenium)

url <- "http://www.forbes.com/powerful-brands/list/#tab:rank"
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(url)

# go to the bottom of the page to load all the page
for(i in 1 : 30)
{
  print(i)
  for(j in 1 : 30)
  {
    js_Script <- paste0("scroll(", (i - 1) * 300, ",", (j - 1) * 300, ");")
    remDr$executeScript(js_Script)
  }
}

Sys.sleep(5)
htmltxt <- remDr$getPageSource()[[1]]
read_html(htmltxt) %>% html_table()

[[1]]
# A tibble: 51 x 6
    Rank Brand           `Brand Value` `1-Yr Value Change` `Brand Revenue` Industry    
   <int> <chr>           <chr>         <chr>               <chr>           <chr>       
 1    NA ""              ""            ""                  ""              ""          
 2     1 "Apple"         "$241.2 B"    "17%"               "$260.2 B"      "Technology"
 3     2 "Google"        "$207.5 B"    "24%"               "$145.6 B"      "Technology"
 4     3 "Microsoft"     "$162.9 B"    "30%"               "$125.8 B"      "Technology"
 5     4 "Amazon"        "$135.4 B"    "40%"               "$260.5 B"      "Technology"
 6     5 "Facebook"      "$70.3 B"     "-21%"              "$49.7 B"       "Technology"
 7     6 "Coca-Cola"     "$64.4 B"     "9%"                "$25.2 B"       "Beverages" 
 8     7 "Disney"        "$61.3 B"     "18%"               "$38.7 B"       "Leisure"   
 9     8 "Samsung"       "$50.4 B"     "-5%"               "$209.5 B"      "Technology"
10     9 "Louis Vuitton" "$47.2 B"     "20%"               "$15 B"         "Luxury"    
# ... with 41 more rows
# i Use `print(n = ...)` to see more rows

Upvotes: 0

Related Questions