Reputation: 11
I'm trying to download the latest version of a file that I download every month, but I can't seem to get download.file to retrieve the latest version.
I've tried using cacheOK=FALSE, but it doesn't seem to affect anything. Putting the same URL directly into a web browser gives the correct (latest) version of the file every time. Looking into the actually downloaded file confirms that R is getting an older version, whereas the browsers are not.
To simplify, I'm downloading a zip file with a csv inside, and reading the csv before processing:
report <- "MWTS"
url <- paste0("http://www.census.gov/econ_getzippedfile/?programCode=", report)
download.file(url, paste0(report, "-mf.zip"), mode="wb", cacheOK=FALSE)
file.raw <- readLines(unz(paste0(report, "-mf.zip"), paste0(report, "-mf.csv")))
As of this posting, grep("^371", file.raw)
gives an integer of length 1, but it should give hundreds (371 is the timecode for November 2022, the latest data release).
Upvotes: 1
Views: 62
Reputation: 160447
It appears that the remote site replies differently based on the presence of some fields, namely User-Agent
and (seems odd to me) Accept-Language
(perhaps to reduce international queries ??).
While this returns the previous data,
library(httr)
res <- GET("https://www.census.gov/econ_getzippedfile/?programCode=MWTS")
writeBin(content(res), "MWTS-mf1.zip")
This returns the current data:
res <- GET("https://www.census.gov/econ_getzippedfile/?programCode=MWTS",
user_agent("''"), add_headers(`Accept-Language`="''"))
writeBin(content(res), "MWTS-mf2.zip")
file.info(Sys.glob("MWTS-mf*.zip"))
# size isdir mode mtime ctime atime exe
# MWTS-mf1.zip 528852 FALSE 666 2023-01-10 13:03:18 2023-01-10 13:03:07 2023-01-10 13:03:19 no
# MWTS-mf2.zip 518246 FALSE 666 2023-01-10 13:03:19 2023-01-10 13:03:19 2023-01-10 13:03:19 no
Note the change in file size. Some interesting points:
user_agent("")
still worked, butadd_headers(`Accept-Language`="")
does not work.I infer from this that they have some filters, perhaps "non-empty accept-language, and a user-agent not known to be curl or similar".
Upvotes: 1