Reputation: 667
I am trying to retrieve brand data for the most powerful brands from http://www.forbes.com/powerful-brands/list/#tab:rank. When I initially failed to retrieve data using getURL
and `HtmlParse I understood that the table data is coming from some other link.
So to make things easy I downloaded the html page and tried to retrieve the data.
I initially tried using
library(XML)
library(RCurl)
library(ggplot2)
forbes <- readHTMLTable("forbes.html",header = TRUE,as.data.frame = TRUE)
forbes
Now when I display forbes I get a list. I had though I would get a dataframe instead.
I checked in the list to find data of the top 10 brands in forbes$the_list
, but did not find the rest of the data of the rest of the companies. i.e. beyond top 10 companies.
How can I retrieve all the tabular data from the forbes page and how can I convert it to a data frame for my manipulation.
Please let me know if you need any further info.
Upvotes: 1
Views: 106
Reputation: 2223
I have been able to extract all the table with the following code :
library(RSelenium)
url <- "http://www.forbes.com/powerful-brands/list/#tab:rank"
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(url)
# go to the bottom of the page to load all the page
for(i in 1 : 30)
{
print(i)
for(j in 1 : 30)
{
js_Script <- paste0("scroll(", (i - 1) * 300, ",", (j - 1) * 300, ");")
remDr$executeScript(js_Script)
}
}
Sys.sleep(5)
htmltxt <- remDr$getPageSource()[[1]]
read_html(htmltxt) %>% html_table()
[[1]]
# A tibble: 51 x 6
Rank Brand `Brand Value` `1-Yr Value Change` `Brand Revenue` Industry
<int> <chr> <chr> <chr> <chr> <chr>
1 NA "" "" "" "" ""
2 1 "Apple" "$241.2 B" "17%" "$260.2 B" "Technology"
3 2 "Google" "$207.5 B" "24%" "$145.6 B" "Technology"
4 3 "Microsoft" "$162.9 B" "30%" "$125.8 B" "Technology"
5 4 "Amazon" "$135.4 B" "40%" "$260.5 B" "Technology"
6 5 "Facebook" "$70.3 B" "-21%" "$49.7 B" "Technology"
7 6 "Coca-Cola" "$64.4 B" "9%" "$25.2 B" "Beverages"
8 7 "Disney" "$61.3 B" "18%" "$38.7 B" "Leisure"
9 8 "Samsung" "$50.4 B" "-5%" "$209.5 B" "Technology"
10 9 "Louis Vuitton" "$47.2 B" "20%" "$15 B" "Luxury"
# ... with 41 more rows
# i Use `print(n = ...)` to see more rows
Upvotes: 0