C. Martin
C. Martin

Reputation: 9

Web-scraping error in R

I'm learning how to do web-scraping in R and thought I'd try things out by using a page with a built in table. My ultimate goal is to have dataframe with four variables (Name, Party, Constituency, Link to individual webpage).

library(rvest)
library(XML)

url <- "http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0"

constituency <- read_html(url)
print(constituency)

constituency_red <- constituency %>% html_nodes('td') %>% html_text()
constituency_red <- paste0(url, constituency_red)
constituency_red <- unique(constituency_red)
constituency_red

The output I get after completing these steps looks like I'm going in the right direction. However, as you can see when scrolling to the right it's still a bit of a mess. Any ideas on what I can do to clean this up?

[974] "http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0\r\n                                Poulter, Dr\r\n                                (Conservative)\r\n                            "                               
[975] "http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0Central Suffolk and North Ipswich"                                                                                                                               
[976] "http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0\r\n                                Pound, Stephen\r\n                                (Labour)\r\n                            "                                  
[977] "http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0Ealing North"                                                                                                                                                    

After this I tried a second approach. The following code appears to give me a clean list of all the hyperlinks. So I'm wondering if this might be a potential work around?

constituency_links <- constituency %>% html_nodes("tr") %>% html_nodes('td') %>% html_nodes("a") %>% html_attr("href")
constituency_links <- paste0(url, constituency_links)
constituency_links <- unique(constituency_links)
constituency_links

My third and final try was to use the following code:

all_constituency <- lapply(constituency_links, function(x) read_html(x))
all_constituency

When I run this things slow down A LOT and then I start getting Error in open.connection(x, "rb") : HTTP error 400. So I tried running it as a loop instead.

for(i in constituency_links){
all_constituency[[i]] <- read_html(i) 
}

I get the same error messages with this approach. Any suggestions on how to pull and clean this information would be much appreciated.

Upvotes: 0

Views: 1435

Answers (2)

Weihuang Wong
Weihuang Wong

Reputation: 13128

We can start by obtaining text strings with MP names, party, and constituency:

text <- constituency %>% html_nodes('table') %>% html_nodes('tr') %>% html_text()
head(text, 3)
# [1] "Surname, First name\r\n                            Constituency\r\n\t\r\n                        "                                                                              
# [2] "A\r\n                            back to top\r\n                        "                                                                                                       
# [3] "\r\n                                Abbott, Ms Diane\r\n                                (Labour)\r\n                            \r\n\t\tHackney North and Stoke Newington\r\n\t"

We can iterate through text, parse each element and split up the string into the fields we want (name, party, constituency):

dd <- lapply(text, function(x) {
  out <- unlist(strsplit(x, "\r\n"))[c(2, 3, 5)]                # Use "\r\n" to split the strings
  as.vector(sapply(out, function(x) sub("(\\t)+|\\s+", "", x))) # Remove spaces and the "\t"
})
# [[1]]
# [1] "Constituency" ""             NA            

# [[2]]
# [1] "back to top" ""            NA           

# [[3]]
# [1] "Abbott, Ms Diane"                  "(Labour)"                         
# [3] "Hackney North and Stoke Newington"

Now, make dd into a dataframe, filtering out irrelevant rows (where party is blank):

df <- data.frame(matrix(unlist(dd), nc = 3, byrow = TRUE), stringsAsFactors = FALSE)
names(df) <- c("name", "party", "con")
df$party <- sub("\\((.*)\\)", "\\1", df$party)   # Remove parentheses
df <- df[df$party != "", ]                       # Remove rows where party is blank
head(df, 3)
#               name        party                               con
# 3 Abbott, Ms Diane       Labour Hackney North and Stoke Newington
# 4 Abrahams, Debbie       Labour       Oldham East and Saddleworth
# 5     Adams, Nigel Conservative                  Selby and Ainsty

We can now deal with the links. When we inspect the links, those that are relevant to the MPs have the word "biographies" in them, so we use that to filter the list:

links <- constituency %>% html_nodes("a") %>% html_attr("href")
links <- links[grepl("biographies", links)]
head(links, 3)
# [1] "http://www.parliament.uk/biographies/commons/ms-diane-abbott/172" 
# [2] "http://www.parliament.uk/biographies/commons/debbie-abrahams/4212"
# [3] "http://www.parliament.uk/biographies/commons/nigel-adams/4057"    

And complete our dataframe by adding the links:

df$links <- links
str(head(df, 3))
# 'data.frame': 3 obs. of  4 variables:
#  $ name : chr  "Abbott, Ms Diane" "Abrahams, Debbie" "Adams, Nigel"
#  $ party: chr  "Labour" "Labour" "Conservative"
#  $ con  : chr  "Hackney North and Stoke Newington" "Oldham East and Saddleworth" "Selby and Ainsty"
#  $ links: chr  "http://www.parliament.uk/biographies/commons/ms-diane-abbott/172" "http://www.parliament.uk/biographies/commons/debbie-abrahams/4212" "http://www.parliament.uk/biographies/commons/nigel-adams/4057"

Upvotes: 1

hrbrmstr
hrbrmstr

Reputation: 78832

It's pretty straightforward:

library(rvest)
library(stringi)
library(purrr)
library(dplyr)

pg <- read_html("http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0")
td_1 <- html_nodes(pg, xpath=".//td[contains(@id,'ctl00_ctl00_FormContent_SiteSpecificPlaceholder_PageContent_rptMembers_ctl')]")

data_frame(mp_name=html_text(html_nodes(td_1, "a")),
           href=html_attr(html_nodes(td_1, "a"), "href"),
           party=map_chr(stri_match_all_regex(html_text(td_1), "\\((.*)\\)"), 2),
           constituency=html_text(html_nodes(pg, xpath=".//tr/td[2]"))) -> df

glimpse(df)
## Observations: 649
## Variables: 4
## $ mp_name      <chr> "Abbott, Ms Diane", "Abrahams, Debbie", "Adams, N...
## $ href         <chr> "http://www.parliament.uk/biographies/commons/ms-...
## $ party        <chr> "Labour", "Labour", "Conservative", "Conservative...
## $ constituency <chr> "Hackney North and Stoke Newington", "Oldham East...

Upvotes: 2

Related Questions