Feyzi Bagirov
Feyzi Bagirov

Reputation: 1372

Web scraping the IIS based website

I am using R to webscrape a table from this site.

I am using library rvest.

#install.packages("rvest", dependencies = TRUE) 
library(rvest) 
OPMpage <- read_html("https://www.opm.gov/policy-data-oversight/data-analysis-documentation/federal-employment-reports/historical-tables/total-government-employment-since-1962/")

I receive this error:

Error in open.connection(x, "rb") : HTTP error 403.

What am I doing wrong?

Upvotes: 6

Views: 13175

Answers (2)

alistaire
alistaire

Reputation: 43354

It's forbidding you from accessing the page because you have NULL in the user-agent string of your headers. (Normally it's a string telling what browser you're using, though some browsers let users spoof other browsers.) Using the httr package, you can set a user-agent string:

library(httr)
library(rvest)

url <- "https://www.opm.gov/policy-data-oversight/data-analysis-documentation/federal-employment-reports/historical-tables/total-government-employment-since-1962/"

x <- GET(url, add_headers('user-agent' = 'Gov employment data scraper ([[your email]])'))

Wrapped in a GET request, add_headers lets you set whatever parameters you like. You could also use the more specific user_agent function in place of add_headers, if that's all you want to set.

In this case any user-agent string will work, but it's polite (see the link at the end) to say who you are and what you want.

Now you can use rvest to parse the HTML and pull out the table. You'll need a way to select the relevant table; looking at the HTML, I saw it had class = "DataTable", but you can also use the SelectorGadget (see the rvest vignettes) to find a valid CSS or XPath selector. Thus

x %>% 
    read_html() %>% 
    html_node('.DataTable') %>% 
    html_table()

gives you a nice (if not totally clean) data.frame.

Note: Scrape responsibly and legally. Given that OPM is a government source, it's in the public domain, but that's not the case with a lot of the web. Always read any terms of service, plus this nice post on how to scrape responsibly.

Upvotes: 13

Hack-R
Hack-R

Reputation: 23210

Your format for read_html or html is correct:

library(rvest)
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
lego_movie <-      html("http://www.imdb.com/title/tt1490017/")

But you're getting a 403 because either the page or the part of the page you're trying to scrape doesn't allow scraping.

You may need to see vignette("selectorgadget") and use selectorgadget in conjunction with rvest:

http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/

But, more likely, it's just not a page that's meant to be scraped. However, I believe Barack Obama and the new United States Chief Data Scientist, DJ Patil, recently rolled out a central hub to obtain that type of U.S. government data for easy import.

Upvotes: 0

Related Questions