Reputation: 841
My csv file (accessible through link and viewable through screenshot) has 8 observations. Obs #5 has a non-standard character in the "author" column. I've shaded this yellow.
https://docs.google.com/spreadsheets/d/1-douIz03OQqahG6WCWY-irOE52oXtDDc4fJ6myMwJDk/edit?usp=sharing
When I run the following:
data1<-read.csv("Book1.csv",colClasses=c("end_date_n"="character","start_date_n"="character"),stringsAsFactors=FALSE)
I get this warning message and only the first 4 rows and a partial 5th row are imported. The import stops at the point where the non-standard character appears in col 5.
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : EOF within quoted string
When I delete the "author" column from my csv source file, the import works fine.
How can I import the full file without having to delete the problem column?
Upvotes: 0
Views: 287
Reputation: 841
A colleague came up with this solution:
"The original character is ^z, which for decades was used by DOS/Windows as an end of file marker. Because UNIX systems never used ^z, the read-in problem is Windows-specific. Windows systems often direct users to enter non-ASCII characters (like é) using “ALT” codes. This may be where the ^z originates."
"Use a utility to translate ^z to something innocuous. The killZ function below takes the name of a file, translates ^z to *, then write the results in the same directory as the original file but with a -noz inserted just before the .txt or .csv (or whatever) filetype. You can then read the -noz file in the same way you have been reading the original .txt or .csv file."
killZ <- function(fname) {
# open in binary mode
f <- file(fname, "rb")
res <- readLines(f)
# translate the ^Z to *
res <- gsub("\032", "*", res, fixed = TRUE)
# Create the new file name
ftype <- stringr::str_extract(fname, "\\..{1,3}$")
new_name <- paste0(gsub(ftype, "", fname), "-noz", ftype)
writeLines(res, con = new_name)
close(f)
return(new_name)
}
Upvotes: 0