Reputation: 11
I'm trying to scrap the geocode in the hyperlink, and wanted to make a table with all the table along with the geocode.
What I did for now is getting a table by using the following code
library(rvest)
url<-"http://www.city-data.com/accidents/acc-Nashua-New-Hampshire.html"
citidata<- html(url)
ta<- citidata %>%
html_nodes("table") %>%
.[1:29] %>%
html_table()
dat<-do.call(rbind, lapply(ta, data.frame, stringsAsFactors=FALSE))
citystate <- citidata %>%
html_node("h1 span") %>%
html_text()
citystate <- gsub("Fatal car crashes and road traffic accidents in ",
"", citystate)
loc<-data.frame(matrix(unlist(strsplit(citystate, ",", fixed = TRUE)), ncol=2, byrow=TRUE))
dat$City<-loc$X1
dat$State<-loc$X2
I got this
Date,Location,Vehicles,Drunken.persons,Fatalites,Persons,Pedestrians,City,State
1 Jun 26, 2013 87:99 PM, Temple Street, 1, -, 1, 1, -, Nashua, New Hampshire
And then I tried to add on geocode into the dataframe but don't know how to do it.
Below is the code for scrapping the geocode in hyperlink.
pg <- html("http://www.city-data.com/accidents/acc-Nashua-New-Hampshire.html")
geo <- data.frame(gsub("javascript:showGoogleSView","",pg %>% html_nodes("a") %>% html_attr("href") %>% .[31:60]))
Upvotes: 1
Views: 570
Reputation: 78842
Not all the incidents have associated lat/lon pairs. The following code uses the fact that the incident date is (apparently) unique and merges the coordinates with the main dat
that you built earlier:
library(rvest)
library(stringr)
library(dplyr)
url <- "http://www.city-data.com/accidents/acc-Nashua-New-Hampshire.html"
# Get all incident tables -------------------------------------------------
citidata <- html(url)
ta <- citidata %>%
html_nodes("table") %>%
.[1:29] %>%
html_table()
# rbind them together -----------------------------------------------------
dat <- do.call(rbind, lapply(ta, data.frame, stringsAsFactors=FALSE))
citystate <- citidata %>%
html_node("h1 span") %>%
html_text()
# Get city/state and add it to the data.frame -------------------------------
citystate <- gsub("Fatal car crashes and road traffic accidents in ",
"", citystate)
loc <- data.frame(matrix(unlist(strsplit(citystate, ",", fixed=TRUE)),
ncol=2, byrow=TRUE))
dat$City <- loc$X1
dat$State <- loc$X2
# Get GPS coords where available ------------------------------------------
coords <- citidata %>%
html_nodes(xpath="//a[@class='showStreetViewLink']") %>%
html_attr("href") %>%
str_extract("([[:digit:]-,\\.]+)") %>%
str_split(",") %>%
unlist() %>%
matrix(ncol=2, byrow=2) %>%
data.frame(stringsAsFactors=FALSE) %>%
rename(lat=X1, lon=X2) %>%
mutate(lat=as.numeric(lat), lon=as.numeric(lon))
# Get GPS coordinates associated incident time for merge ------------------
coord_time <- pg %>%
html_nodes(xpath="//a[@class='showStreetViewLink']/../preceding-sibling::td[1]") %>%
html_text() %>%
data_frame(Date=.)
# Merge the coordinates with the data.frame we built earlier --------------
left_join(dat, bind_cols(coords, coord_time))
Upvotes: 1