adamdsmith
adamdsmith

Reputation: 930

Bypassing "ghost" line break or end of file (EOF) in data.table::fread

I'm loading several large, tab-delimited text files exported from an (accessible to me) database into R using data.table::fread. fread handles most of the files with great ease and speed, but one of the files is generating a regularly-reported fread error:

Error in fread(read_problem, encoding = "UTF-8", na.strings = "", header = TRUE,  : 
                   Expected sep ('  ') but new line or EOF ends field ...

A smaller (2000 rows) version of the file containing the offending line is available here (RDS file).

Here's how I've tried to diagnose the problem to this point:

library(data.table) # I'm using 1.9.7 development (same error with 1.9.6)
read_problem <- readRDS("read_problem.rds")
error <- fread(read_problem, encoding = "UTF-8", na.strings = "",
               header = TRUE, sep = "\t", 
               colClasses = rep("character", 44),  # For simplicity
               verbose = TRUE)

If I excise the offending line, the problem disappears:

cat(read_problem, file = "temp")   
string_vec <- readLines("temp")
clipped_vec <- string_vec[-1027] # Get rid of problem line 1027
restored <- paste(clipped_vec, collapse = "\n")
noerror <- fread(restored, encoding = "UTF-8", na.strings = "",
                 header = TRUE, sep = "\t", 
                 colClasses = rep("character", 44)) # For simplicity
class(noerror)
[1] "data.table" "data.frame"

dim(noerror) 
[1] 1999   44

The error message seems clear enough: fread is looking for a "\t" but is finding something else in its place.

But I find nothing obvious from a closer look at the offending line relative to those around it.

The number of tab characters is the same

sapply(gregexpr("\t", string_vec[1026:1028]), length)
[1] 43 43 43

Line break information seems identical

unlist(gregexpr("\n", string_vec[1026:1028]))
[1] -1 -1 -1

Here's a look at the offending line itself as a string:

string_vec[1027]
[1] "URN:CornellLabOfOrnithology:EBIRD:OBS132960387\t29816\tspecies\tNelson's Sparrow\tAmmodramus nelsoni\t\t\t1\t\t\tUnited States\tUS\tGeorgia\tUS-GA\tGlynn\tUS-GA-127\tUS-GA_3181\t\t\tJekyll Island\tL140461\tH\t31.0464993\t-81.4113007\t1990-11-03\t13:15:00\t\"Jekyll Island and Causeway. Partly cloudy, mild, NE wind 8-15 mph. Note: Did very little birding in upland habitats as time available was rather brief.\"  Data entered on behalf of Paul Sykes by Alison Huff ([email protected]) on 12-15-11.\tListed on old Georgia Field Checklist as \"Sparrow, Sharp-tailed.\"\tobsr289931\tPaul\tSykes\tS9336358\teBird - Traveling Count\tEBIRD\t270\t8.047\t\t1\t1\t\t1\t0\t\t"

Any advice to get around this problem without the manual extraction of offending lines?

Upvotes: 7

Views: 4230

Answers (3)

Mahesh
Mahesh

Reputation: 29

For this error of "Expected sep ('|') but new line or EOF ends field 6 on line 8863 when reading data:"

you need to just add additional quote="" in the fread code

fread(load_file_from_directory, sep = "|",quote="")

Upvotes: 0

Udit Gupta
Udit Gupta

Reputation: 1

One possible solution is to:

  1. Read all CSVs into one list

    df<-lapply(csv, function(x) read.csv(x, stringsAsFactors = FALSE))

Each element of the list represents one CSV

  1. Convert list into one big dataframe

df2 <- ldply(df, data.frame)

  1. Remove row containing EOF using grep as usual.

df3<-df2[!grepl("eof", df2$V1),]

where V1 is the column name where EOF is.

Upvotes: 0

Arun
Arun

Reputation: 118849

With with this commit, this is now fixed in v1.9.7, current development version. The next stable release should therefore be able to read this properly using quote="".

require(data.table) #v1.9.7+
fread('"abcd efgh." ijkl.\tmnop "qrst uvwx."\t45\n', quote="")
#                    V1                V2 V3
# 1: "abcd efgh." ijkl. mnop "qrst uvwx." 45

On the 1027th line, at the end of "Sparrow, Sharp-tailed." there's only one tab. Where as in the other lines, after that field, there are two before the "obsr[0-9]" field starts.

The number of tabs seem to match because, on line 1027, there's a tab before "Listed on old Georgia Field" instead of a space..

Therefore line 1027 gets only 43 cols instead of 44. This seems to be the issue.


Looking at this again, it seems like Listed on old Georgia Field Checklist as "Sparrow, Sharp-tailed." should be read as a separate column but instead is being read together with the previous column...

Here's a smaller reproducible example:

# note that there are only 2 instead of 3 columns
fread('"abcd efgh." ijkl.\tmnop "qrst uvwx."\t45\n')
#                                     V1 V2
# 1: abcd efgh." ijkl.\tmnop "qrst uvwx. 45

# add a header column and it returns the same error
fread('a\tb\tc\n"abcd efgh." ijkl.\tmnop "qrst uvwx."\t45\n')
# Error in fread("a\tb\tc\n\"abcd efgh.\" ijkl.\tmnop \"qrst uvwx.\"\t45\n") : 
#   Expected sep (' ') but new line, EOF (or other non printing character) 
#   ends field 1 when detecting types (   first): "abcd efgh." ijkl.    mnop 
#   "qrst uvwx."    45

Filed 1367.

Upvotes: 6

Related Questions