Andy
Andy

Reputation: 11

Read a large data set via fread in R but only need a subset (one variable that equals some values)

I am trying to read a large dataset (>30G) in R but my laptop only has 16G of RAM. But all I need is only a subset of this dataset. Specifically I need all the observations whose ID (there is one variable in my dataset that represents this ID) equals to some values (these values come from another dataset). If I have enough RAM, it will be natural to read the two data files first and then merge by the common ID.

With the lack in RAM, is it possible to pre-process the data file somehow using a shell command so that I can use it as an argument for cmd of fread. Or does anyone have an alternative solution? Thanks in advance!

Upvotes: 1

Views: 328

Answers (1)

GKi
GKi

Reputation: 39717

You can prepossess your data as described in R Data Import/Export using the GNU Text Utilities join and sort.

#Create files to use
t1 <- tempfile() #File 1 with id and data
write.table(data.frame(id=1:5, val=5:1), t1, row.names=FALSE, col.names=FALSE)
t2 <- tempfile() #File 2 with id's which should be used from File 1
write.table(c(1,3,4), t2, row.names=FALSE, col.names=FALSE)

t3 <- tempfile()
t4 <- tempfile()
read.table(pipe(paste("sort -k 1b,1", t1, ">", t3, "
sort -u -k 1b,1", t2, ">", t4, "
join", t3, t4)))
#  V1 V2
#1  1  5
#2  3  3
#3  4  2

Upvotes: 3

Related Questions