Ash Reddy
Ash Reddy

Reputation: 1042

Why does it take more time for data.table::fread to read a file when filename is specified differently?

I'm reading a file into R using fread using below methods:

fread("file:///C:/Users/Desktop/ads.csv")  
fread("C:/Users/Desktop/ads.csv")       # Just omitted "file:///"  

I've observed the runtime to be very different:

microbenchmark(  
fread("file:///C:/Users/Desktop/ads.csv"),  
fread("C:/Users/Desktop/ads.csv")
)

Unit: microseconds
                          expr               min        lq      mean     median       uq       max    neval cld
fread("file:///C:/Users/Desktop/ads.csv") 5755.975 6027.4735 6696.7807 6235.3365 6506.652 41257.476   100   b  
fread("C:/Users/Desktop/ads.csv")          525.492  584.0215  673.7166  647.4745  727.703  1476.191   100   a   

Why does the run-time vary so much? There isn't noticeable difference between 2 variants when I was using read.csv() though

Upvotes: 25

Views: 2038

Answers (1)

MichaelChirico
MichaelChirico

Reputation: 34703

Update:

The following has been added to ?fread:

When input begins with http://, https://, ftp://, ftps://, or file://, fread detects this and downloads the target to a temporary file (at tempfile()) before proceeding to read the file as usual. Secure URLS (ftps:// and https://) are downloaded with curl::curl_download; ftp:// and http:// paths are downloaded with download.file and method set to getOption("download.file.method"), defaulting to "auto"; and file:// is downloaded with download.file with method="internal". NB: this implies that for file://, even files found on the current machine will be "downloaded" (i.e., hard-copied) to a temporary file. See ?download.file for more details.


From the source of fread:

if (str6 == "ftp://" || str7 == "http://" || str7 == "file://") {
  method = if (str7 == "file://") "auto"
           else getOption("download.file.method", default = "auto")
  download.file(input, tmpFile, method = method, mode = "wb", quiet = !showProgress)
}

That is, your file is being "downloaded" to a temporary file, which should consist of deep-copying the contents of the file to a temporary location. file:// is not really intended for use on local files, but on files in a network that need to be downloaded locally before being read (IIUC; FWIW, this is what fread's testing regime uses to imitate file download while testing on CRAN, where external file download is impossible).

I also notice that your timings are on the order of microseconds, which could explain the discrepancy vs. read.csv. Imagine read.csv takes 1 second to read the file, while fread takes .01 seconds; file copying takes .05 seconds. Then in both cases read.csv will look about the same (1 vs 1.05 seconds), while fread looks substantially slower for the file:// case (.01 vs. .06 seconds).

Upvotes: 26

Related Questions