MLE
MLE

Reputation: 1043

fread reading data structure wrong with quotes

I have a 5 G file data to load. fread seems to be a fast way to load them but it reads all my data structures wrong. It looks like it is the quotes that result the problem.

# Codes. I don't know how to put raw csv data here.   
dt<-fread("data.csv",header=T)
dt2<-read.csv("data.csv",header=T)
str(dt)
str(dt2)

This is the output. All data structures of fread variables are char regardless whether it is num or char.

enter image description here enter image description here

Upvotes: 1

Views: 893

Answers (2)

Mihai Chelaru
Mihai Chelaru

Reputation: 8187

It looks as if the fread command will detect the type in a particular column and then assign the lowest type it can to that column based on what the column contains. From the fread documentation:

A sample of 1,000 rows is used to determine column types (100 rows from 10 points). The lowest type for each column is chosen from the ordered list: logical, integer, integer64, double, character. This enables fread to allocate exactly the right number of rows, with columns of the right type, up front once. The file may of course still contain data of a higher type in rows outside the sample. In that case, the column types are bumped mid read and the data read on previous rows is coerced.

This means that if you have a column with mostly numeric type values it might assign the column as numeric, but then if it finds any character type values later on it will coerce anything read up to that point to character type.

You can read about these type conversions here, but the long and short of it seems to be that trying to convert a character column to numeric for values that are not numeric will result in those values being converted to NA, or a double might be converted to an integer, leading to a loss of precision.

You might be okay with this loss of precision, but fread will not allow you to do this conversion using colClasses. You might want to go in and remove non-numeric values yourself.

Upvotes: 0

zacdav
zacdav

Reputation: 4671

It's curious that fread didn't use numeric for the id column, maybe some entries contain non-numeric values?

The documentation suggests the use of colClasses parameter.

dt <- fread("data.csv", header = T, colClasses = c("numeric", "character"))

The documentation has a warning for using this parameter:

A character vector of classes (named or unnamed), as read.csv. Or a named list of vectors of column names or numbers, see examples. colClasses in fread is intended for rare overrides, not for routine use. fread will only promote a column to a higher type if colClasses requests it. It won't downgrade a column to a lower type since NAs would result. You have to coerce such columns afterwards yourself, if you really require data loss.

Upvotes: 3

Related Questions