CptnJustc
CptnJustc

Reputation: 11

Using download.file gives old version of file

I'm trying to download the latest version of a file that I download every month, but I can't seem to get download.file to retrieve the latest version.

I've tried using cacheOK=FALSE, but it doesn't seem to affect anything. Putting the same URL directly into a web browser gives the correct (latest) version of the file every time. Looking into the actually downloaded file confirms that R is getting an older version, whereas the browsers are not.

To simplify, I'm downloading a zip file with a csv inside, and reading the csv before processing:

report <- "MWTS"
url <- paste0("http://www.census.gov/econ_getzippedfile/?programCode=", report)
download.file(url, paste0(report, "-mf.zip"), mode="wb", cacheOK=FALSE)
file.raw <- readLines(unz(paste0(report, "-mf.zip"), paste0(report, "-mf.csv")))

As of this posting, grep("^371", file.raw) gives an integer of length 1, but it should give hundreds (371 is the timecode for November 2022, the latest data release).

Upvotes: 1

Views: 62

Answers (1)

r2evans
r2evans

Reputation: 160447

It appears that the remote site replies differently based on the presence of some fields, namely User-Agent and (seems odd to me) Accept-Language (perhaps to reduce international queries ??).

While this returns the previous data,

library(httr)
res <- GET("https://www.census.gov/econ_getzippedfile/?programCode=MWTS")
writeBin(content(res), "MWTS-mf1.zip")

This returns the current data:

res <- GET("https://www.census.gov/econ_getzippedfile/?programCode=MWTS",
           user_agent("''"), add_headers(`Accept-Language`="''"))
writeBin(content(res), "MWTS-mf2.zip")
file.info(Sys.glob("MWTS-mf*.zip"))
#                size isdir mode               mtime               ctime               atime exe
# MWTS-mf1.zip 528852 FALSE  666 2023-01-10 13:03:18 2023-01-10 13:03:07 2023-01-10 13:03:19  no
# MWTS-mf2.zip 518246 FALSE  666 2023-01-10 13:03:19 2023-01-10 13:03:19 2023-01-10 13:03:19  no

Note the change in file size. Some interesting points:

  • Using user_agent("") still worked, but
  • using add_headers(`Accept-Language`="") does not work.

I infer from this that they have some filters, perhaps "non-empty accept-language, and a user-agent not known to be curl or similar".

Upvotes: 1

Related Questions