Reputation: 3158
I'm attempting to scrape the Washington Post's database on police shootings. Since it's not html I can't use rvest
, so instead I used RSelenium and phantomjs.
library(RSelenium)
checkForServer()
startServer()
eCap <- list(phantomjs.binary.path = "C:/Program Files/Chrome Driver/phantomjs.exe")
remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities = eCap)
remDr$open()
remDr$navigate("http://www.washingtonpost.com/graphics/national/police-shootings/")
Upon inspecting the source, it's apparent the items I'm interested in have the following id
and class
<div id="js-list-690" class="listWrapper cf">
or in Chrome:
I can access the text of the individual item:
remDr$findElement("css", "#js-list-691")$getElementText()
returns
[[1]]
[1] "An unidentified person, a 47-year-old Hispanic man, was shocked with a stun gun and shot on July 30, 2015, in Whittier, Calif. Los Angeles County deputies were investigating a domestic disturbance when he threatened the officers and struck one of them with a metal rod.\nMALEDEADLY WEAPONHISPANIC45 TO 54\nCBS Los AngelesWhittier Daily News"}
But if I want to get a list of all these items:
remDr$findElements("class name", "listWrapper cf")
results in an error.
How do I
listWrapper cf
class?Upvotes: 4
Views: 2602
Reputation: 78792
It'd be way easier to just use the JSON data directly (use the "Developer Tools" in almost any modern browser to track the URLs loaded...this didn't take long to find in that list):
library(jsonlite)
url <- "https://js.washingtonpost.com/graphics/policeshootings/policeshootings.json?d14385542"
shootings <- fromJSON(url)
dplyr::glimpse(shootings)
## Observations: 564
## Variables:
## $ id (int) 3, 4, 5, 8, 9, 11, 13, 15, 16, 17, 19, 21, ...
## $ date (chr) "2015-01-02", "2015-01-02", "2015-01-03", "...
## $ description (chr) "Elliot, who was on medication for depressi...
## $ blurb (chr) "a 53-year-old man of Asian heritage armed ...
## $ name (chr) "Tim Elliot", "Lewis Lee Lembke", "John Pau...
## $ age (int) 53, 47, 23, 32, 39, 18, 22, 35, 34, 47, 25,...
## $ gender (chr) "M", "M", "M", "M", "M", "M", "M", "M", "F"...
## $ race (chr) "A", "W", "H", "W", "H", "W", "H", "W", "W"...
## $ armed (chr) "gun", "gun", "unarmed", "toy weapon", "nai...
## $ city (chr) "Shelton", "Aloha", "Wichita", "San Francis...
## $ state (chr) "WA", "OR", "KS", "CA", "CO", "OK", "AZ", "...
## $ address (chr) "600 block of E. Island Lake Drive", "4519 ...
## $ lat (dbl) 47.24683, 45.48620, 37.69477, 37.76291, 40....
## $ lon (dbl) -123.12159, -122.89128, -97.28055, -122.422...
## $ is_geocoding_exact (lgl) TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T...
## $ mental (lgl) TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FAL...
## $ sources (list) http://kbkw.com/local-news/329755, http://...
## $ photos (list) NULL, NULL, 107, , , , //img.washingtonpos...
## $ videos (list) NULL, NULL, NULL, NULL, NULL, NULL, NULL, ...
Upvotes: 6