Danilo Correa
Danilo Correa

Reputation: 173

Error reading a zip file with fread function from data.table package

Error reading a zipped ".txt" file from a https web site with fread() function

Hi everyone,

I'm trying to read a zipped ".txt" file from a https web site with fread() function, but i'm getting and error.

I also tried to read the zip file after download it, but i got the same error. Any ideas how to solve it?

fileUrl <- "https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip"

dt <- fread(fileUrl)

Error in fread(fileUrl) : 
  Internal error: invalid head position. jump=1, headPos=0000020B75510005, thisJumpStart=0000020B7560C040, sof=0000020B75510000

### tried read locally after download too:

dt <- fread("Dataset.zip")

But i got the same error message.

### unzipped, the file is read without error:

dt <- fread("household_power_consumption.txt")

str(dt)

Classes ‘data.table’ and 'data.frame':  2075259 obs. of  9 variables:
 $ Date                 : chr  "16/12/2006" "16/12/2006" "16/12/2006" "16/12/2006" ...
 $ Time                 : chr  "17:24:00" "17:25:00" "17:26:00" "17:27:00" ...
 $ Global_active_power  : chr  "4.216" "5.360" "5.374" "5.388" ...
 $ Global_reactive_power: chr  "0.418" "0.436" "0.498" "0.502" ...
 $ Voltage              : chr  "234.840" "233.630" "233.290" "233.740" ...
 $ Global_intensity     : chr  "18.400" "23.000" "23.000" "23.000" ...
 $ Sub_metering_1       : chr  "0.000" "0.000" "0.000" "0.000" ...
 $ Sub_metering_2       : chr  "1.000" "1.000" "2.000" "1.000" ...
 $ Sub_metering_3       : num  17 16 17 17 17 17 17 17 17 16 ...
 - attr(*, ".internal.selfref")=<externalptr>

Upvotes: 1

Views: 1402

Answers (2)

Jose
Jose

Reputation: 421

Just a brief update: you can use shell commands in fread to extract the files, like this:

url = "https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip"

download.file(url, dest = "./household_power_c.zip", mode = "wb")
dt <- data.table::fread(cmd = "unzip -cq ./household_power_c.zip")

Output:


> str(dt)
Classes ‘data.table’ and 'data.frame':  2075259 obs. of  9 variables:
 $ Date                 : chr  "16/12/2006" "16/12/2006" "16/12/2006" "16/12/2006" ...
 $ Time                 : chr  "17:24:00" "17:25:00" "17:26:00" "17:27:00" ...
 $ Global_active_power  : chr  "4.216" "5.360" "5.374" "5.388" ...
 $ Global_reactive_power: chr  "0.418" "0.436" "0.498" "0.502" ...
 $ Voltage              : chr  "234.840" "233.630" "233.290" "233.740" ...
 $ Global_intensity     : chr  "18.400" "23.000" "23.000" "23.000" ...
 $ Sub_metering_1       : chr  "0.000" "0.000" "0.000" "0.000" ...
 $ Sub_metering_2       : chr  "1.000" "1.000" "2.000" "1.000" ...
 $ Sub_metering_3       : num  17 16 17 17 17 17 17 17 17 16 ...
 - attr(*, ".internal.selfref")=<externalptr> 
> 

Using shell commands is quite handy, you can explore all the options in the unzip command (see $ man unzip) for instance, extract just one file:

url <- "http://www.bls.gov/cex/pumd/data/comma/diary14.zip"
download.file(url, dest = "dataset.zip", mode="wb")
shc = 'unzip -cq dataset.zip diary14/expd141.csv' # shell command to extract one file of many files within the zip directory

zd <- data.table::fread(cmd = shc))

See this link for more information about using command-line tools in fread:

https://github.com/Rdatatable/data.table/wiki/Convenience-features-of-fread#1-using-command-line-tools-directly

Upvotes: 0

MichaelChirico
MichaelChirico

Reputation: 34773

fread does not automatically read .zip files, but you can unzip them cross-platform from within R:

tmp_dir = tempdir()
tmp = tempfile(tmpdir = tmp_dir)
download.file(fileUrl, tmp)
outf = unzip(tmp, list = TRUE)$Name
unzip(tmp, outf, exdir = tmp_dir)
fread(file.path(tmp_dir, outf))[1:10]
          Date     Time Global_active_power Global_reactive_power Voltage
 1: 16/12/2006 17:24:00               4.216                 0.418 234.840
 2: 16/12/2006 17:25:00               5.360                 0.436 233.630
 3: 16/12/2006 17:26:00               5.374                 0.498 233.290
 4: 16/12/2006 17:27:00               5.388                 0.502 233.740
 5: 16/12/2006 17:28:00               3.666                 0.528 235.680
 6: 16/12/2006 17:29:00               3.520                 0.522 235.020
 7: 16/12/2006 17:30:00               3.702                 0.520 235.090
 8: 16/12/2006 17:31:00               3.700                 0.520 235.220
 9: 16/12/2006 17:32:00               3.668                 0.510 233.990
10: 16/12/2006 17:33:00               3.662                 0.510 233.860
    Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
 1:           18.400          0.000          1.000             17
 2:           23.000          0.000          1.000             16
 3:           23.000          0.000          2.000             17
 4:           23.000          0.000          1.000             17
 5:           15.800          0.000          1.000             17
 6:           15.000          0.000          2.000             17
 7:           15.800          0.000          1.000             17
 8:           15.800          0.000          1.000             17
 9:           15.800          0.000          1.000             17
10:           15.800          0.000          2.000             16

Upvotes: 3

Related Questions