Reputation: 6157
Some .csv
files with numerical data I work with contain errors, each error is marked as random string, for example after reading in, data frame could look like that :
set.seed(123)
rand.str <- paste0(letters[sample(10)], collapse="")
wrong.output <- data.frame(a=1:5, b=c(4:5, rand.str, 7:8), stringsAsFactors=FALSE)
in this case proper output is :
proper.output <- data.frame(a=1:5, b=c(4:5, NA, 7:8))
after reading with read.csv
each column with at least one character value is treated as character
column.
Can I mark errors (random strings) as NA
s while reading-in file? If not, what is the most convenient, proper or fastest method for subsetting them with NA
's ?
There is na.strings
argument in read.csv
, but it is the solution only in simpler cases where it can be used like: na.strings=c("-", "unavailable")
(can't see any duplicate, so I guess there is simple, workaround)
colClasses
suggestion does not work
read.csv("test.txt", sep=",", colClasses = c("numeric", "numeric"))
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : scan() expected 'a real', got 'chdgfajibe' In addition: Warning message: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'test.txt'
Upvotes: 1
Views: 83
Reputation: 4648
I adapted this solution from a different solution for csv reading which is 7 years back. I thought it is a cleaner solution. It gives your desired output.
setClass("Alpha")
# replacing words with empty characters
setAs("character", "Alpha",
function(from) as.numeric(gsub('[[:alpha:]]+', '', from) ) )
read.csv('data.csv', colClasses = c('numeric','Alpha'))
output
a b
1 1 4
2 2 5
3 3 NA
4 4 7
5 5 8
Source: How to read data when some numbers contain commas as thousand separator?
Upvotes: 1