Reputation: 9
I'm learning how to do web-scraping in R and thought I'd try things out by using a page with a built in table. My ultimate goal is to have dataframe with four variables (Name, Party, Constituency, Link to individual webpage).
library(rvest)
library(XML)
url <- "http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0"
constituency <- read_html(url)
print(constituency)
constituency_red <- constituency %>% html_nodes('td') %>% html_text()
constituency_red <- paste0(url, constituency_red)
constituency_red <- unique(constituency_red)
constituency_red
The output I get after completing these steps looks like I'm going in the right direction. However, as you can see when scrolling to the right it's still a bit of a mess. Any ideas on what I can do to clean this up?
[974] "http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0\r\n Poulter, Dr\r\n (Conservative)\r\n "
[975] "http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0Central Suffolk and North Ipswich"
[976] "http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0\r\n Pound, Stephen\r\n (Labour)\r\n "
[977] "http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0Ealing North"
After this I tried a second approach. The following code appears to give me a clean list of all the hyperlinks. So I'm wondering if this might be a potential work around?
constituency_links <- constituency %>% html_nodes("tr") %>% html_nodes('td') %>% html_nodes("a") %>% html_attr("href")
constituency_links <- paste0(url, constituency_links)
constituency_links <- unique(constituency_links)
constituency_links
My third and final try was to use the following code:
all_constituency <- lapply(constituency_links, function(x) read_html(x))
all_constituency
When I run this things slow down A LOT and then I start getting Error in open.connection(x, "rb") : HTTP error 400.
So I tried running it as a loop instead.
for(i in constituency_links){
all_constituency[[i]] <- read_html(i)
}
I get the same error messages with this approach. Any suggestions on how to pull and clean this information would be much appreciated.
Upvotes: 0
Views: 1435
Reputation: 13128
We can start by obtaining text strings with MP names, party, and constituency:
text <- constituency %>% html_nodes('table') %>% html_nodes('tr') %>% html_text()
head(text, 3)
# [1] "Surname, First name\r\n Constituency\r\n\t\r\n "
# [2] "A\r\n back to top\r\n "
# [3] "\r\n Abbott, Ms Diane\r\n (Labour)\r\n \r\n\t\tHackney North and Stoke Newington\r\n\t"
We can iterate through text
, parse each element and split up the string into the fields we want (name, party, constituency):
dd <- lapply(text, function(x) {
out <- unlist(strsplit(x, "\r\n"))[c(2, 3, 5)] # Use "\r\n" to split the strings
as.vector(sapply(out, function(x) sub("(\\t)+|\\s+", "", x))) # Remove spaces and the "\t"
})
# [[1]]
# [1] "Constituency" "" NA
# [[2]]
# [1] "back to top" "" NA
# [[3]]
# [1] "Abbott, Ms Diane" "(Labour)"
# [3] "Hackney North and Stoke Newington"
Now, make dd
into a dataframe, filtering out irrelevant rows (where party
is blank):
df <- data.frame(matrix(unlist(dd), nc = 3, byrow = TRUE), stringsAsFactors = FALSE)
names(df) <- c("name", "party", "con")
df$party <- sub("\\((.*)\\)", "\\1", df$party) # Remove parentheses
df <- df[df$party != "", ] # Remove rows where party is blank
head(df, 3)
# name party con
# 3 Abbott, Ms Diane Labour Hackney North and Stoke Newington
# 4 Abrahams, Debbie Labour Oldham East and Saddleworth
# 5 Adams, Nigel Conservative Selby and Ainsty
We can now deal with the links. When we inspect the links, those that are relevant to the MPs have the word "biographies" in them, so we use that to filter the list:
links <- constituency %>% html_nodes("a") %>% html_attr("href")
links <- links[grepl("biographies", links)]
head(links, 3)
# [1] "http://www.parliament.uk/biographies/commons/ms-diane-abbott/172"
# [2] "http://www.parliament.uk/biographies/commons/debbie-abrahams/4212"
# [3] "http://www.parliament.uk/biographies/commons/nigel-adams/4057"
And complete our dataframe by adding the links:
df$links <- links
str(head(df, 3))
# 'data.frame': 3 obs. of 4 variables:
# $ name : chr "Abbott, Ms Diane" "Abrahams, Debbie" "Adams, Nigel"
# $ party: chr "Labour" "Labour" "Conservative"
# $ con : chr "Hackney North and Stoke Newington" "Oldham East and Saddleworth" "Selby and Ainsty"
# $ links: chr "http://www.parliament.uk/biographies/commons/ms-diane-abbott/172" "http://www.parliament.uk/biographies/commons/debbie-abrahams/4212" "http://www.parliament.uk/biographies/commons/nigel-adams/4057"
Upvotes: 1
Reputation: 78832
It's pretty straightforward:
library(rvest)
library(stringi)
library(purrr)
library(dplyr)
pg <- read_html("http://www.parliament.uk/mps-lords-and-offices/mps/?sort=0")
td_1 <- html_nodes(pg, xpath=".//td[contains(@id,'ctl00_ctl00_FormContent_SiteSpecificPlaceholder_PageContent_rptMembers_ctl')]")
data_frame(mp_name=html_text(html_nodes(td_1, "a")),
href=html_attr(html_nodes(td_1, "a"), "href"),
party=map_chr(stri_match_all_regex(html_text(td_1), "\\((.*)\\)"), 2),
constituency=html_text(html_nodes(pg, xpath=".//tr/td[2]"))) -> df
glimpse(df)
## Observations: 649
## Variables: 4
## $ mp_name <chr> "Abbott, Ms Diane", "Abrahams, Debbie", "Adams, N...
## $ href <chr> "http://www.parliament.uk/biographies/commons/ms-...
## $ party <chr> "Labour", "Labour", "Conservative", "Conservative...
## $ constituency <chr> "Hackney North and Stoke Newington", "Oldham East...
Upvotes: 2