Web-scraping error in R

Question

I'm learning how to do web-scraping in R and thought I'd try things out by using a page with a built in table. My ultimate goal is to have dataframe with four variables (Name, Party, Constituency, Link to individual webpage).

library(rvest)
library(XML)

url <- "http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0"

constituency <- read_html(url)
print(constituency)

constituency_red <- constituency %>% html_nodes('td') %>% html_text()
constituency_red <- paste0(url, constituency_red)
constituency_red <- unique(constituency_red)
constituency_red

The output I get after completing these steps looks like I'm going in the right direction. However, as you can see when scrolling to the right it's still a bit of a mess. Any ideas on what I can do to clean this up?

[974] "http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0
                                Poulter, Dr
                                (Conservative)
                            "                               
[975] "http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0Central Suffolk and North Ipswich"                                                                                                                               
[976] "http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0
                                Pound, Stephen
                                (Labour)
                            "                                  
[977] "http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0Ealing North"

After this I tried a second approach. The following code appears to give me a clean list of all the hyperlinks. So I'm wondering if this might be a potential work around?

constituency_links <- constituency %>% html_nodes("tr") %>% html_nodes('td') %>% html_nodes("a") %>% html_attr("href")
constituency_links <- paste0(url, constituency_links)
constituency_links <- unique(constituency_links)
constituency_links

My third and final try was to use the following code:

all_constituency <- lapply(constituency_links, function(x) read_html(x))
all_constituency

When I run this things slow down A LOT and then I start getting Error in open.connection(x, "rb") : HTTP error 400. So I tried running it as a loop instead.

for(i in constituency_links){
all_constituency[[i]] <- read_html(i) 
}

I get the same error messages with this approach. Any suggestions on how to pull and clean this information would be much appreciated.

hrbrmstr · Accepted Answer

It's pretty straightforward:

library(rvest)
library(stringi)
library(purrr)
library(dplyr)

pg <- read_html("http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0")
td_1 <- html_nodes(pg, xpath=".//td[contains(@id,'ctl00_ctl00_FormContent_SiteSpecificPlaceholder_PageContent_rptMembers_ctl')]")

data_frame(mp_name=html_text(html_nodes(td_1, "a")),
           href=html_attr(html_nodes(td_1, "a"), "href"),
           party=map_chr(stri_match_all_regex(html_text(td_1), "$(.*)$"), 2),
           constituency=html_text(html_nodes(pg, xpath=".//tr/td[2]"))) -> df

glimpse(df)
## Observations: 649
## Variables: 4
## $ mp_name       "Abbott, Ms Diane", "Abrahams, Debbie", "Adams, N...
## $ href          "http://www.parliament.uk/biographies/commons/ms-...
## $ party         "Labour", "Labour", "Conservative", "Conservative...
## $ constituency  "Hackney North and Stoke Newington", "Oldham East...

Web-scraping error in R

Answers (2)

Related Questions