Manupati punith
Manupati punith

Reputation: 11

web scraping using rvest in r for getting insides from a web page

I am facing difficulty while loading the rvest/XML packages in to R and I am unable to process the code.

How exactly should I use rvest for web scraping? How to read a table from web page "https://www.forbes.com/powerful-brands/list/" ?

library(rvest)
forbs <- readHTMLTable("https://www.forbes.com/powerful-brands/list/")
head(forbs)
View(forbs)

it is showing error like

forbs1<-html_text("#list_table") Error in UseMethod("xml_text") : no applicable method for 'xml_text' applied to an object of class "character"

Upvotes: 1

Views: 125

Answers (1)

Emmanuel Hamel
Emmanuel Hamel

Reputation: 2223

Here is an approach that can be considered :

library(rvest)
url <- "https://www.forbes.com/the-worlds-most-valuable-brands/#4052ca71119c"
read_html(html_Content) %>% html_table()

[[1]]
# A tibble: 51 x 6
    Rank Brand           `Brand Value` `1-Yr Value Change` `Brand Revenue` Industry    
   <int> <chr>           <chr>         <chr>               <chr>           <chr>       
 1    NA ""              ""            ""                  ""              ""          
 2     1 "Apple"         "$241.2 B"    "17%"               "$260.2 B"      "Technology"
 3     2 "Google"        "$207.5 B"    "24%"               "$145.6 B"      "Technology"
 4     3 "Microsoft"     "$162.9 B"    "30%"               "$125.8 B"      "Technology"
 5     4 "Amazon"        "$135.4 B"    "40%"               "$260.5 B"      "Technology"
 6     5 "Facebook"      "$70.3 B"     "-21%"              "$49.7 B"       "Technology"
 7     6 "Coca-Cola"     "$64.4 B"     "9%"                "$25.2 B"       "Beverages" 
 8     7 "Disney"        "$61.3 B"     "18%"               "$38.7 B"       "Leisure"   
 9     8 "Samsung"       "$50.4 B"     "-5%"               "$209.5 B"      "Technology"
10     9 "Louis Vuitton" "$47.2 B"     "20%"               "$15 B"         "Luxury"    
# ... with 41 more rows
# i Use `print(n = ...)` to see more rows

Here is another approach that can be considered :

library(RDCOMClient)
library(rvest)
url <- "https://www.forbes.com/the-worlds-most-valuable-brands/#4052ca71119c"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(15)
doc <- IEApp$Document()
html_Content <- doc$documentElement()$innerHtml()
read_html(html_Content) %>% html_table()

[[1]]
# A tibble: 51 x 6
    Rank Brand           `Brand Value` `1-Yr Value Change` `Brand Revenue` Industry    
   <int> <chr>           <chr>         <chr>               <chr>           <chr>       
 1    NA ""              ""            ""                  ""              ""          
 2     1 "Apple"         "$241.2 B"    "17%"               "$260.2 B"      "Technology"
 3     2 "Google"        "$207.5 B"    "24%"               "$145.6 B"      "Technology"
 4     3 "Microsoft"     "$162.9 B"    "30%"               "$125.8 B"      "Technology"
 5     4 "Amazon"        "$135.4 B"    "40%"               "$260.5 B"      "Technology"
 6     5 "Facebook"      "$70.3 B"     "-21%"              "$49.7 B"       "Technology"
 7     6 "Coca-Cola"     "$64.4 B"     "9%"                "$25.2 B"       "Beverages" 
 8     7 "Disney"        "$61.3 B"     "18%"               "$38.7 B"       "Leisure"   
 9     8 "Samsung"       "$50.4 B"     "-5%"               "$209.5 B"      "Technology"
10     9 "Louis Vuitton" "$47.2 B"     "20%"               "$15 B"         "Luxury"    
# ... with 41 more rows
# i Use `print(n = ...)` to see more rows

Here is another approach that can be considered :

library(RSelenium)
library(rvest)
url <- "https://www.forbes.com/the-worlds-most-valuable-brands/#4052ca71119c"
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(url)
Sys.sleep(15)
remDr$getPageSource()[[1]] %>% read_html() %>% html_table()

[[1]]
# A tibble: 51 x 6
    Rank Brand           `Brand Value` `1-Yr Value Change` `Brand Revenue` Industry    
   <int> <chr>           <chr>         <chr>               <chr>           <chr>       
 1    NA ""              ""            ""                  ""              ""          
 2     1 "Apple"         "$241.2 B"    "17%"               "$260.2 B"      "Technology"
 3     2 "Google"        "$207.5 B"    "24%"               "$145.6 B"      "Technology"
 4     3 "Microsoft"     "$162.9 B"    "30%"               "$125.8 B"      "Technology"
 5     4 "Amazon"        "$135.4 B"    "40%"               "$260.5 B"      "Technology"
 6     5 "Facebook"      "$70.3 B"     "-21%"              "$49.7 B"       "Technology"
 7     6 "Coca-Cola"     "$64.4 B"     "9%"                "$25.2 B"       "Beverages" 
 8     7 "Disney"        "$61.3 B"     "18%"               "$38.7 B"       "Leisure"   
 9     8 "Samsung"       "$50.4 B"     "-5%"               "$209.5 B"      "Technology"
10     9 "Louis Vuitton" "$47.2 B"     "20%"               "$15 B"         "Luxury"    
# ... with 41 more rows
# i Use `print(n = ...)` to see more rows

Upvotes: 0

Related Questions