Jen
Jen

Reputation: 11

rvest get html hyperlink in table

I'm trying to scrap the geocode in the hyperlink, and wanted to make a table with all the table along with the geocode.

What I did for now is getting a table by using the following code

library(rvest)

url<-"http://www.city-data.com/accidents/acc-Nashua-New-Hampshire.html"

citidata<- html(url)
ta<- citidata %>%
html_nodes("table") %>%
.[1:29] %>%
html_table()

dat<-do.call(rbind, lapply(ta, data.frame, stringsAsFactors=FALSE))

citystate <- citidata %>%
 html_node("h1 span") %>%
 html_text()

citystate <- gsub("Fatal car crashes and road traffic accidents in ",
                  "", citystate)

loc<-data.frame(matrix(unlist(strsplit(citystate, ",", fixed = TRUE)), ncol=2, byrow=TRUE))
dat$City<-loc$X1
dat$State<-loc$X2

I got this

Date,Location,Vehicles,Drunken.persons,Fatalites,Persons,Pedestrians,City,State
1 Jun 26, 2013 87:99 PM, Temple Street, 1, -, 1, 1, -, Nashua, New Hampshire

And then I tried to add on geocode into the dataframe but don't know how to do it.

Below is the code for scrapping the geocode in hyperlink.

pg <- html("http://www.city-data.com/accidents/acc-Nashua-New-Hampshire.html")
geo <- data.frame(gsub("javascript:showGoogleSView","",pg %>% html_nodes("a") %>% html_attr("href") %>% .[31:60]))

Upvotes: 1

Views: 570

Answers (1)

hrbrmstr
hrbrmstr

Reputation: 78842

Not all the incidents have associated lat/lon pairs. The following code uses the fact that the incident date is (apparently) unique and merges the coordinates with the main dat that you built earlier:

library(rvest)
library(stringr)
library(dplyr)

url <- "http://www.city-data.com/accidents/acc-Nashua-New-Hampshire.html"

# Get all incident tables -------------------------------------------------

citidata <- html(url)

ta <- citidata %>%
  html_nodes("table") %>%
  .[1:29] %>%
  html_table()

# rbind them together -----------------------------------------------------

dat <- do.call(rbind, lapply(ta, data.frame, stringsAsFactors=FALSE))

citystate <- citidata %>%
  html_node("h1 span") %>%
  html_text()

# Get city/state and add it to the data.frame -------------------------------

citystate <- gsub("Fatal car crashes and road traffic accidents in ", 
                  "", citystate)

loc <- data.frame(matrix(unlist(strsplit(citystate, ",", fixed=TRUE)), 
                         ncol=2, byrow=TRUE))

dat$City <- loc$X1
dat$State <- loc$X2

# Get GPS coords where available ------------------------------------------

coords <- citidata %>% 
  html_nodes(xpath="//a[@class='showStreetViewLink']") %>% 
  html_attr("href") %>% 
  str_extract("([[:digit:]-,\\.]+)") %>% 
  str_split(",") %>% 
  unlist() %>% 
  matrix(ncol=2, byrow=2) %>% 
  data.frame(stringsAsFactors=FALSE) %>% 
  rename(lat=X1, lon=X2) %>% 
  mutate(lat=as.numeric(lat), lon=as.numeric(lon))

# Get GPS coordinates associated incident time for merge ------------------

coord_time <- pg %>% 
  html_nodes(xpath="//a[@class='showStreetViewLink']/../preceding-sibling::td[1]") %>%
  html_text() %>% 
  data_frame(Date=.)

# Merge the coordinates with the data.frame we built earlier --------------

left_join(dat, bind_cols(coords, coord_time))

Upvotes: 1

Related Questions