Enrique
Enrique

Reputation: 10127

skip and autostart in fread

I am using the following code to read a file with the data.table library:

fread(myfile, header=FALSE, sep=",", skip=100, colClasses=c("character","numeric","NULL","numeric"))

but I get the following error:

The supplied 'sep' was not found on line 80. To read the file as a single character column set sep='\n'.

It says it did not find sep on line 80, however I set skip=100 so it should not pay attention to the first 100 lines.

UPDATE: I tried with skip=101 and it worked but it skips the first line where the data starts

I am using version 1.9.2 of the data.table package and R version 3.02 64 bit on windows 7

Upvotes: 5

Views: 17255

Answers (1)

Matt Dowle
Matt Dowle

Reputation: 59612

We don't know the version number you're using, but I can make a guess in this case.

Try setting autostart=101.

Note the first paragraph of Details in ?fread :

Once the separator is found on line autostart, the number of columns is determined. Then the file is searched backwards from autostart until a row is found that doesn't have that number of columns. Thus, the first data row is found and any human readable banners are automatically skipped. This feature can be particularly useful for loading a set of files which may not all have consistently sized banners. Setting skip>0 overrides this feature by setting autostart=skip+1 and turning off the search upwards step.

the skip argument has :

If -1 (default) use the procedure described below starting on line autostart to find the first data row. skip>=0 means ignore autostart and take line skip+1 as the first data row (or column names according to header="auto"|TRUE|FALSE as usual). skip="string" searches for "string" in the file (e.g. a substring of the column names row) and starts on that line (inspired by read.xls in package gdata).

and the autostart argument has :

Any line number within the region of machine readable delimited text, by default 30. If the file is shorter or this line is empty (e.g. short files with trailing blank lines) then the last non empty line (with a non empty line above that) is used. This line and the lines above it are used to auto detect sep, sep2 and the number of fields. It's extremely unlikely that autostart should ever need to be changed, we hope.

In your case perhaps the human readable header is much larger than 30 rows, which is why I guess setting autostart=101 might work. No need to use skip.

One motivation is for convenience when a file contains multiple tables. By setting autostart to any row inside the table that you want to pluck out of the file, it'll find the first data row and header row for you automatically, and then read just that table. You don't have to worry about getting the exact line number at the start of data like you do with skip. fread can only read one table currently. It could feasibly return a list of tables from a single file, but that's getting a bit complicated and nobody has asked for that.

Upvotes: 4

Related Questions