Reputation: 11
I am trying to read a large dataset (>30G) in R but my laptop only has 16G of RAM. But all I need is only a subset of this dataset. Specifically I need all the observations whose ID (there is one variable in my dataset that represents this ID) equals to some values (these values come from another dataset). If I have enough RAM, it will be natural to read the two data files first and then merge by the common ID.
With the lack in RAM, is it possible to pre-process the data file somehow using a shell command so that I can use it as an argument for cmd
of fread
. Or does anyone have an alternative solution? Thanks in advance!
Upvotes: 1
Views: 328
Reputation: 39717
You can prepossess your data as described in R Data Import/Export using the GNU Text Utilities join
and sort
.
#Create files to use
t1 <- tempfile() #File 1 with id and data
write.table(data.frame(id=1:5, val=5:1), t1, row.names=FALSE, col.names=FALSE)
t2 <- tempfile() #File 2 with id's which should be used from File 1
write.table(c(1,3,4), t2, row.names=FALSE, col.names=FALSE)
t3 <- tempfile()
t4 <- tempfile()
read.table(pipe(paste("sort -k 1b,1", t1, ">", t3, "
sort -u -k 1b,1", t2, ">", t4, "
join", t3, t4)))
# V1 V2
#1 1 5
#2 3 3
#3 4 2
Upvotes: 3