R: Simple Random Sample of Massive Dataframe

Question

I have a massive (8GB) dataset, which I am simply unable to read into R using my existing setup. Attempting to use fread on the dataset crashes the R session immediately, and attempting to read in random lines from the underlying file was insufficient because: (1) I don't have a good way of knowing that total number of rows in the dataset; (2) my method was not a true "random sampling."

These attempts to get the number of rows have failed (they take as long as simply reading the data in:

length(count.fields("file.dat", sep = "|"))
read.csv.sql("file.dat", header = FALSE, sep = "|", sql = "select count(*) from file")

Is there any way via R or some other program to generate a random sample from a large underlying dataset?

Potential idea: Is it possible, given a "sample" of the first several rows to get a sense of the average amount of information contained on a per-row basis. And then back-out how many rows there must be given the size of the dataset (8 GB)? This wouldn't be accurate, but it might give a ball-park figure that I could just under-cut.

dww · Accepted Answer

Here's one option, using the ability of fread to accept a shell command that preprocesses the file as its input. Using this option we can run a gawk script to extract the required lines. Note you may need to install gawk if it is not already on your system. If you have awk instead on your system, you can use that instead.

First lets create a dummy file to test on:

library(data.table)
dt = data.table(1:1e6, sample(letters, 1e6, replace = TRUE))
write.csv(dt, 'test.csv', row.names = FALSE)

Now we can use the shell command wc to find how many lines there are in the file:

nl = read.table(pipe("wc -l test.csv"))[[1]]

Take a sample of line numbers and write them (in ascending order) to a temp file which makes them accessible easily to gawk.

N = 20 # number of lines to sample
sample.lines = sort(sample(2:nl, N)) #start sample at line 2 to exclude header 
cat(paste0(sample.lines, collapse = '
'), file = "lines.txt")

Now we are ready to read in the sample using fread and gawk (based on this answer). You can also try some of the other gawk scripts in this linked question which could possibly be be more efficient on very large data.

dt.sample = fread("gawk 'NR == FNR {nums[$1]; next} FNR in nums' lines.txt test.csv")

R: Simple Random Sample of Massive Dataframe

Answers (1)

Related Questions