Reputation: 1043
I have a 5 G file data to load. fread seems to be a fast way to load them but it reads all my data structures wrong. It looks like it is the quotes that result the problem.
# Codes. I don't know how to put raw csv data here.
dt<-fread("data.csv",header=T)
dt2<-read.csv("data.csv",header=T)
str(dt)
str(dt2)
This is the output. All data structures of fread variables are char regardless whether it is num or char.
Upvotes: 1
Views: 893
Reputation: 8187
It looks as if the fread
command will detect the type in a particular column and then assign the lowest type it can to that column based on what the column contains. From the fread documentation:
A sample of 1,000 rows is used to determine column types (100 rows from 10 points). The lowest type for each column is chosen from the ordered list: logical, integer, integer64, double, character. This enables fread to allocate exactly the right number of rows, with columns of the right type, up front once. The file may of course still contain data of a higher type in rows outside the sample. In that case, the column types are bumped mid read and the data read on previous rows is coerced.
This means that if you have a column with mostly numeric type values it might assign the column as numeric
, but then if it finds any character
type values later on it will coerce anything read up to that point to character
type.
You can read about these type conversions here, but the long and short of it seems to be that trying to convert a character
column to numeric
for values that are not numeric will result in those values being converted to NA
, or a double might be converted to an integer, leading to a loss of precision.
You might be okay with this loss of precision, but fread
will not allow you to do this conversion using colClasses
. You might want to go in and remove non-numeric values yourself.
Upvotes: 0
Reputation: 4671
It's curious that fread
didn't use numeric for the id column, maybe some entries contain non-numeric values?
The documentation suggests the use of colClasses
parameter.
dt <- fread("data.csv", header = T, colClasses = c("numeric", "character"))
The documentation has a warning for using this parameter:
A character vector of classes (named or unnamed), as read.csv. Or a named list of vectors of column names or numbers, see examples. colClasses in fread is intended for rare overrides, not for routine use. fread will only promote a column to a higher type if colClasses requests it. It won't downgrade a column to a lower type since NAs would result. You have to coerce such columns afterwards yourself, if you really require data loss.
Upvotes: 3