Nancy
Nancy

Reputation: 4089

"Select" argument in R's data.table::fread

I'm trying to read selected columns from a csv using fread(). I've found that I can use a vector of column numbers, but not with the column names. In regards to the "select" argument, the documentation just says "Vector of column names or numbers to keep, drop the rest." They also provide the example of

fread(data, select=c("A","D"))

Thus, why does my code throw a subscript out of bounds error? Here's the gist of my code, hopefully generalizable to other users:

test = data.frame(matrix(c(1:50),ncol = 5))
names(test) = c("A", "B", "C", "D", "E")
write.table(test, file = "/Users/me/Desktop/test.txt", sep = ",")
fread("/Users/me/Desktop/test.txt", sep = ",", header = TRUE, select = c("A","B"))

Giving

Error in ans[[1]] : subscript out of bounds

However this gives the first column as well as the row number as a column:

fread("/Users/me/Desktop/test.txt", sep = ",", header = TRUE, select = c(1,2))
    1  1
1:  2  2
2:  3  3
3:  4  4
4:  5  5
5:  6  6
6:  7  7
7:  8  8
8:  9  9
9: 10 10

...And read.table() is able to uneventfully read the whole data set:

read.table("/Users/me/Desktop/test.txt", sep = ",", header = TRUE)
    A  B  C  D  E
1   1 11 21 31 41
2   2 12 22 32 42
3   3 13 23 33 43
4   4 14 24 34 44
5   5 15 25 35 45
6   6 16 26 36 46
7   7 17 27 37 47
8   8 18 28 38 48
9   9 19 29 39 49
10 10 20 30 40 50

Something is obviously going on with the rownames and the header, but I'm not sure how to resolve it. I've tried with and without headers. The data set I'm using (not in this example) already has rownames, so re-writing it with rownames = FALSE isn't an option.

Upvotes: 3

Views: 4828

Answers (4)

Rich Scriven
Rich Scriven

Reputation: 99331

This answer assumes your original data was not produced via write.table(), that you were given a file and are attempting to read it via fread() (which is also stated in the question).


I believe you are having this problem because of the row names in the file. I have yet to come up with a direct way to apply fread() to the data, but I think this work-around will be safe and won't cost you much in terms of efficiency. Here are the steps ...

1) Read the first line of the file with scan() and add an extra "" element at the beginning. This is to offset the header row to account for the row names in the file.

nm <- c("", scan("test.txt", "", nlines = 1, sep = ","))

2) Define the columns you want and find them in nm. Instead of 1 and 4, the offset now gives us 2 and 5 and accounts for the row names.

sel <- nm %in% c("A", "D")

3) Read the file, starting at the second line (i.e. without the header), and use sel in the selection argument.

library(data.table)
dt <- fread("test.txt", skip = 1, select = which(sel))

4) Now that we've read the data that we want, we can reset the column names.

setnames(dt, nm[sel])[]
#      A  D
#  1:  1 31
#  2:  2 32
#  3:  3 33
#  4:  4 34
#  5:  5 35
#  6:  6 36
#  7:  7 37
#  8:  8 38
#  9:  9 39
# 10: 10 40

If the example you give is a good representation of the actual data, I don't see any reason why this wouldn't work. Hope it works for you.

Upvotes: 3

nicola
nicola

Reputation: 24480

This example shows why you always need to check carefully the format of the file you are producing. There are some differences between read.table and fread; here the issue comes from the row names and how they are written by write.table. As always, reading carefully the doc (?write.table) helps a lot.

write.table by default writes the row names. But here is how:

filename<-"somefilename.txt"
write.table(test, file = filename, sep = ",")
readLines(filename,2)
#[1] "\"A\",\"B\",\"C\",\"D\",\"E\"" 
#"\"1\",1,11,21,31,41"

I read the first two lines of the produced file. Reading them carefully, you can see that this is not a "standard" CSV. Why? Because the header has a 4 commas while the "data" lines 5. For a standard CSV, you should put a comma before the first column name. This is achieved by adding col.names=NA in write.table:

write.table(test, file = filename, sep = ",", col.names=NA)
#now works
fread(filename, sep = ",", header = TRUE, select = c("A","B"))

You can check and see that now a comma as the first character of the file appears. Alternatively, you can avoid to write the row names putting row.names=FALSE in write.table, but this is not always possible, since some times they are meaningful.

Upvotes: 1

wmoco_6725
wmoco_6725

Reputation: 3169

The problem is not in your fread, it's in your write.table. By default it writes the names of each row in the first column. Have a look at the file written.

Try this: (explicity NOT writing the row.names)

write.table(test, file = "/Users/me/Desktop/test.txt", 
            sep = ",", row.names=FALSE)

Then do your fread(). It will work.

Upvotes: -1

hrbrmstr
hrbrmstr

Reputation: 78792

library(data.table)
library(readr)

# save mtcars as CSV w/o row names or column names

write_csv(mtcars, "mtcars.csv", col_names=FALSE)

# read in the same file with fread but since we tell it
# to not use a header, we have to specify the column names
# the way fread will create them otherwise you get your error

fread("mtcars.csv", header=FALSE, select=c("V1", "V4"))

##      V1  V4
## 1: 21.0 110
## 2: 21.0 110
## 3: 22.8  93
## 4: 21.4 110
## 5: 18.7 175
## 6: 18.1 105

# try again, this time keeping column names in the 
# data file
write.csv(mtcars, file = "mtcars.csv", row.names=FALSE)

# now read it back in and select based on column names
# I picked different columns
head(fread("mtcars.csv", select=c("mpg", "qsec")))

##     mpg  qsec
## 1: 21.0 16.46
## 2: 21.0 17.02
## 3: 22.8 18.61
## 4: 21.4 19.44
## 5: 18.7 17.02
## 6: 18.1 20.2

Upvotes: -1

Related Questions