Tdebeus
Tdebeus

Reputation: 1599

How do I scrape selected list-items from a webpage?

I'm trying to scrape Marvel movies with their characters (featured, support, antagonists, other) on marvel.wikia.com. Now these characters live in lists in the DOM and I can't get the right html_nodes() to get all the list items underneath each character type.

The following code extracts all the listed links, while I want only the ones belonging to the featured- support- antagonists- and othercharacters (not applicable for X2).

library(rvest)
library(tidyverse)

test_url <- "http://marvel.wikia.com/wiki/X2_(film)"

read_html(test_url) %>%
  html_nodes("li > a") %>%
  html_text() 

Desired outcome:

# A tibble: 16 x 3
   movie type                  character                  
   <chr> <chr>                 <chr>                      
 1 X2    Featured Characters   Professor Charles Xavier   
 2 X2    Featured Characters   Wolverine (Logan)          
 3 X2    Featured Characters   Storm (Ororo Munroe)       
 4 X2    Featured Characters   Dr. Jean Grey              
 5 X2    Featured Characters   Cyclops (Scott Summers)    
 6 X2    Featured Characters   Rogue (Marie)              
 7 X2    Featured Characters   Iceman (Bobby Drake)       
 8 X2    Supporting Characters Nightcrawler (Kurt Wagner) 
 9 X2    Supporting Characters Pyro (John Allerdyce)      
10 X2    Supporting Characters Mystique (Raven Darkholme) 
11 X2    Supporting Characters Magneto (Erik Lehnsherr)   
12 X2    Antagonists           Col. William Stryker       
13 X2    Antagonists           Sgt. Lyman                 
14 X2    Antagonists           Unnamed Soldiers           
15 X2    Antagonists           Deathstrike (Yuriko Oyama) 
16 X2    Antagonists           Mutant 143 (Jason Stryker)

Upvotes: 1

Views: 149

Answers (1)

Prem
Prem

Reputation: 11955

You could start with something like this -

library(rvest)
library(tidyverse)

test_url <- "http://marvel.wikia.com/wiki/X2_(film)"

#scrape data
url_data <- read_html(test_url) %>%
  html_nodes(xpath = '//*[@id="mw-content-text"]/ul') %>%
  html_text()

#format scrapped data into desired format
df <- data.frame(movie = gsub(".*/", "", test_url),
                 type = c("Featured Characters", "Supporting_Characters", "Antagonists", "Other_Characters"),
                 characters = url_data[1:4]) %>%
  separate_rows(characters, sep = "\\n")

which gives

> head(df)
      movie                type                         characters
1 X2_(film) Featured Characters                             X-Men 
2 X2_(film) Featured Characters          Professor Charles Xavier 
3 X2_(film) Featured Characters                 Wolverine (Logan) 
4 X2_(film) Featured Characters              Storm (Ororo Munroe) 
5 X2_(film) Featured Characters   Dr. Jean Grey   (Apparent death)
6 X2_(film) Featured Characters           Cyclops (Scott Summers) 

Upvotes: 2

Related Questions