Reputation: 355
While reading a squid log in zipped format using read.table(), I get the following error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 134147 did not have 10 elements
Unzipping the file, I see that line 134147 is corrupted. However, there are other lines also similarly corrupted, so it is not possible for me to manually run read.tables, see the offending line number, delete it, and repeat the whole process again.
Is there a way by which I could tell R to ignore such lines and continue reading the rest of the table? I tried with try() but without any success.
I have read some related posts on ignoring read.table() errors, but all of them talk of correcting the offending line(s) which is not an option for me because the files are zipped, and I would have to manually unzip them; and also there may be several such corrupted lines.
My code for reading (with the try block):
try({dfApr4gw1 <- read.table(
"log1.gz", header=FALSE,
col.names = c("time", "duration", "local ip", "squid result code", "bytes", "request method", "url", "user", "squid hierarchy code", "type"),
na.strings="-",
colClasses = c("numeric", "integer", "factor", "factor", "integer", "factor", "character", "character", "character", "factor")
)})
Upvotes: 2
Views: 518
Reputation: 521093
One option which might suit your use case would be to read in each line a single column, and then later split that line by a delimeter, which in your case appears to be a comma. You can discard any rows which do not have the 10 columns you are expecting.
dfApr4gw1 <- read.table(
"log1.gz", header=FALSE, col.names = c("column"), na.strings="-",
colClasses = c("character"))
rows.keep <- apply(dfApr4gw1, 1, function(x) {
if (length(strsplit(pangram, " ")[[1]]) == 10) {
return TRUE
} else {
return FALSE
}
})
dfApr4gw1 <- dfApr4gw1[rows.keep, ]
Now dfApr4gw1
only contains well-formed rows, and you can convert this to a data frame with 10 columns of appropriate types easily.
Upvotes: 1