GValER
GValER

Reputation: 47

Can not access data on web -URL HTTP status was '403 Forbidden'-

This simple code...

url1 <- 'https://www.sec.gov/Archives/edgar/data/0001336528/0001172661-21-001865.txt'
data1 <- readLines(url1)

...leads to the following error message:

<cannot open URL 'https://www.sec.gov/Archives/edgar/data/0001336528/0001172661-21-001865.txt': HTTP status was '403 Forbidden' Error in file(con, "r") : cannot open the connection In addition: Warning message: In file(con, "r") :>

I tried a lot of ways, and I reached the conclusion that the site rejects my request when it is made from R (with that or any code). Sometimes, I got no error and the code worked fine but no usually. I can always save the .txt directly from the browser (I can not save it to my pc using R) and then import from the file in my pc.

Example -> I save page as .txt and then

data1 <- readLines("Persh01.txt")

As it worked sometimes, I also created a loop that tried until done, and it did the job, but I changed the pc and it does not seem to work anymore.

data1 <- NA
data1 <- try(readLines(url1))
while (inherits(data1, "try-error")) {
  data1 <- try(readLines(url1))
}

Would someone help me? Thanks

Upvotes: 1

Views: 3464

Answers (1)

Allan Cameron
Allan Cameron

Reputation: 174128

You need to pass a couple of headers to the server before it accepts your request. In this case, you need an appropriate User-Agent string and a Connection = "keep alive" to prevent the 403 error.

library(httr)

url1 <- 'https://www.sec.gov/Archives/edgar/data/0001336528/0001172661-21-001865.txt'
UA <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0"

res   <- GET(url1, add_headers(`Connection` = "keep-alive", `User-Agent` = UA))
data1 <- strsplit(content(res), "\n")[[1]]

head(data1, 10) 

#>  [1] "<SEC-DOCUMENT>0001172661-21-001865.txt : 20210816"   
#>  [2] "<SEC-HEADER>0001172661-21-001865.hdr.sgml : 20210816"
#>  [3] "<ACCEPTANCE-DATETIME>20210816163055"                 
#>  [4] "ACCESSION NUMBER:\t\t0001172661-21-001865"             
#>  [5] "CONFORMED SUBMISSION TYPE:\t13F-HR"                   
#>  [6] "PUBLIC DOCUMENT COUNT:\t\t2"                           
#>  [7] "CONFORMED PERIOD OF REPORT:\t20210630"                
#>  [8] "FILED AS OF DATE:\t\t20210816"                         
#>  [9] "DATE AS OF CHANGE:\t\t20210816"                        
#> [10] "EFFECTIVENESS DATE:\t\t20210816" 

Note that the site's robot.txt file disallows web crawling and indexing from this part of the site, so you need to check you are not violating the site's usage policy.

Upvotes: 3

Related Questions