user2205916
user2205916

Reputation: 3456

Web scraping with R and selector gadget

I am trying to scrape data from a website using R. I am using rvest in an attempt to mimic an example scraping the IMDB page for the Lego Movie. The example advocates use of a tool called Selector Gadget to help easily identify the html_node associated with the data you are seeking to pull.

I am ultimately interested in building a data frame that has the following schema/columns: rank, blog_name, facebook_fans, twitter_followers, alexa_rank.

My code below. I was able to use Selector Gadget to correctly identity the html tag used in the Lego example. However, following the same process and same code structure as the Lego example, I get NAs (...using firstNAs introduced by coercion[1] NA ). My code is below:

data2_html = read_html("http://blog.feedspot.com/video_game_news/")
data2_html %>%
  html_node(".stats") %>%
  html_text() %>%
  as.numeric()

I have also experimented with: html_node("html_node(".stats , .stats span")), which seems to work for the "Facebook fans" column since it reports 714 matches, however only returns 1 number is returned.

714 matches for .//*[@class and contains(concat(' ', normalize-space(@class), ' '), ' stats ')] | .//*[@class and contains(concat(' ', normalize-space(@class), ' '), ' stats ')]/descendant-or-self::*/span: using first{xml_node}
<td>
[1] <span>997,669</span>

Upvotes: 1

Views: 4762

Answers (3)

alistaire
alistaire

Reputation: 43334

You can use html_table to extract the whole table with minimal work:

library(rvest)
library(tidyverse)

# scrape html
h <- 'http://blog.feedspot.com/video_game_news/' %>% read_html()

game_blogs <- h %>% 
    html_node('table') %>%    # select enclosing table node
    html_table() %>%    # turn table into data.frame
    set_names(make.names) %>%    # make names syntactic
    mutate(Blog.Name = sub('\\s?\\+.*', '', Blog.Name)) %>%    # extract title from name info
    mutate_at(3:5, parse_number) %>%    # make numbers actually numbers
    tbl_df()    # for printing

game_blogs
#> # A tibble: 119 x 5
#>     Rank                  Blog.Name Facebook.Fans Twitter.Followers Alexa.Rank
#>    <int>                      <chr>         <dbl>             <dbl>      <dbl>
#>  1     1 Kotaku - The Gamer's Guide        997669           1209029        873
#>  2     2          IGN | Video Games       4070476           4493805        399
#>  3     3                  Xbox Wire      23141452          10210993        879
#>  4     4  Official PlayStation Blog      38019811          12059607        500
#>  5     5              Nintendo Life         35977             95044      17727
#>  6     6              Game Informer        603681           1770812      10057
#>  7     7            Reddit | Gamers       1003705            430017         25
#>  8     8                    Polygon        623808            485827       1594
#>  9     9   Xbox Live's Major Nelson         65905            993481      23114
#> 10    10                      VG247        397798            202084       3960
#> # ... with 109 more rows

It's worth checking that everything is parsed like you want, but it should be usable at this point.

Upvotes: 2

Andrew Lavers
Andrew Lavers

Reputation: 4378

This uses html_nodes (plural) and str_replace to remove commas in numbers. Not sure if these are all the stats you need.

library(rvest)
library(stringr)
data2_html = read_html("http://blog.feedspot.com/video_game_news/")
data2_html %>%
  html_nodes(".stats") %>%
  html_text() %>%
  str_replace_all(',', '') %>%
  as.numeric()

Upvotes: 0

R. Schifini
R. Schifini

Reputation: 9313

This may help you:

library(rvest)

d1 <- read_html("http://blog.feedspot.com/video_game_news/")

stats <- d1 %>%
    html_nodes(".stats") %>%
    html_text()

blogname <- d1%>%
    html_nodes(".tlink") %>%
    html_text()

Note that it is html_nodes (plural)

Result:

> head(blogname)
[1] "Kotaku - The Gamer's Guide" "IGN | Video Games"          "Xbox Wire"                  "Official PlayStation Blog" 
[5] "Nintendo Life "             "Game Informer" 

> head(stats,12)
 [1] "997,669"    "1,209,029"  "873"        "4,070,476"  "4,493,805"  "399"        "23,141,452" "10,210,993" "879"       
[10] "38,019,811" "12,059,607" "500"

blogname returns the list of blog names that is easy to manage. On the other hand the stats info comes out mixed. This is due to the way the stats class for Facebook and Twitter fans are indistinguishable from one another. In this case the output array has the information every three numbers, that is stats = c(fb, tw, alx, fb, tw, alx...). You should separate each vector from this one.

FBstats = stats[seq(1,length(stats),3)]

> head(stats[seq(1,length(stats),3)])
[1] "997,669"    "4,070,476"  "23,141,452" "38,019,811" "35,977"     "603,681"   

Upvotes: 2

Related Questions