Reputation: 327

Using fread() instead of read_table() in R for large data sets

I have two different folders, 'big_data' ( one ~ 2GB file) and the other 'small_data' ( ~ 6 files that ~150 MB in total). I want to upload a data file that contains multiple header rows and whitespaces into R. the structure of the file is something like the one shown below

# File name
#
#@   1  "Some text"                                                   "aa"
#@   2  "Some text"                                                   "bb"
#@   3  "Some text"                                                   "cc"
#@   4  "Some text"                                                   "dd"
#@   5  "Some text"                                                   "ee"
#@   6  "Some text"                                                   "ff"

#
#
#
#

 0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00  
 1.000000e-03  3.727051e-04  2.532203e-04  4.736003e-04  3.727051e-07  0.000000e+00   
 2.000000e-03  2.266785e-03  1.540081e-03  2.880429e-03  2.639490e-06  0.000000e+00  
 3.000000e-03  7.538553e-03  5.121786e-03  9.579321e-03  1.017804e-05  0.000000e+00   
 4.000000e-03  1.838835e-02  1.249329e-02  2.336627e-02  2.856639e-05  0.000000e+00  
 5.000000e-03  3.703296e-02  2.516073e-02  4.705817e-02  6.559935e-05  0.000000e+00 
 6.000000e-03  2.266785e-03  1.540081e-03  2.880429e-03  2.639490e-06  0.000000e+00  
 7.000000e-03  7.538553e-03  5.121786e-03  9.579321e-03  1.017804e-05  0.000000e+00   
 8.000000e-03  1.838835e-02  1.249329e-02  2.336627e-02  2.856639e-05  0.000000e+00  
 9.000000e-03  3.703296e-02  2.516073e-02  4.705817e-02  6.559935e-05  0.000000e+00

As you can see, it contains 10 rows and 6 columns. The text after '#@' are the column names. I wrote the following function to import the data from the user specified folder. It imports the data, skips the rows that contain header text, extracts the column names and removes NA columns to all files in that folder. I used read_table() if the 'user_specified_folder' is 'small_data' and fread() if the 'user_specified_folder' is 'big_data'. It works for the former, but not the latter. There is no error with fread(), The R session memory was about 6 GB, but the imported file size was only 15 kB with just the column names.

imp_data <- function(user_specified_folder){
  
  # path of a single file for extracting header name
  foo      <- here::here("data",species)
  p_header <- here::here("data",species, list.files(foo)[1])
  
  # find skip value
  
  skip <- tail(grep("\\#", readLines(p_header)),1)
  
  # Get a List of all`.txt` files
  filenames <- list.files(here::here("data",species), pattern = "*.txt", full.names = T)
  

  # Load data sets
  if(user_specified_folder == "big_data"){
    
    list.DFs  <- lapply(filenames, fread , skip = skip)
  }
  
  else{
    
    list.DFs  <- lapply(filenames, read_table, col_names = F, skip = skip)
  }
  
  # name the data frame the same as file name
  names(list.DFs) <- basename(filenames)

  # extract column names from a dataset for a particular type
  rl <- readLines(p_header)
  nms <- rl[grep('^#@', rl)] |>
    sub(pattern='[^"]*"(.*)"\\s*$', replacement='\\1') |> 
    gsub(pattern='[" ]+', replacement=' ')
  
  # remove NA columns
  list.DFs <- lapply(list.DFs, function(x) x[, colSums(is.na(x)) < nrow(x)])
  
  # Apply column names to all datasets of a particular user_specified_folder
  list.DFs <- lapply(list.DFs, setNames, nms)
  
  return(list.DFs)
}

data_big_data <- imp_data("big_data")

How do I get the above code working with fread()?

Upvotes: 0

Answers (2)

Wimpel

Reputation: 27732

I suggest using the cmd-argument from fread(), to pre-process the file before reading.

Functionally, it reads the output of a shell command using fread. In this case, we use a grep-like command (findstr is present in all modern windows (i'm using W10) versions) to read only the lines we want to read.

On windows it works with findstr, on a *nix environment, you have to adapt the code below to use of grep or awk.

Possible downside of this solution: Depending on the shell-command chosen, it might make your code less portable between operating systems.

library(data.table)

# extract the forth column of all lines starting with "#@"
colnames <- fread(cmd = 'findstr "^#@" weird_file.csv', 
                sep = " ", header = FALSE, select = 4)

# extract all lines not startting with a "#" 
mydata <- fread(cmd = 'findstr "^[^#]" weird_file.csv', 
                sep = " ", header = FALSE, col.names = unlist(colnames) )


#       aa           bb           cc           dd           ee ff
# 1: 0.000 0.0000000000 0.0000000000 0.0000000000 0.000000e+00  0
# 2: 0.001 0.0003727051 0.0002532203 0.0004736003 3.727051e-07  0
# 3: 0.002 0.0022667850 0.0015400810 0.0028804290 2.639490e-06  0
# 4: 0.003 0.0075385530 0.0051217860 0.0095793210 1.017804e-05  0
# 5: 0.004 0.0183883500 0.0124932900 0.0233662700 2.856639e-05  0
# 6: 0.005 0.0370329600 0.0251607300 0.0470581700 6.559935e-05  0
# 7: 0.006 0.0022667850 0.0015400810 0.0028804290 2.639490e-06  0
# 8: 0.007 0.0075385530 0.0051217860 0.0095793210 1.017804e-05  0
# 9: 0.008 0.0183883500 0.0124932900 0.0233662700 2.856639e-05  0
#10: 0.009 0.0370329600 0.0251607300 0.0470581700 6.559935e-05  0

sample data used

Upvotes: 2

Roland

Reputation: 132706

This is how I would read these files:

library(data.table)
filename <- 'E:/temp/test.txt'
con <- file(filename)

n <- 0
repeat {
  n <- n + 10
  header <- readLines(con, n)
  test1 <- grepl("#", header, fixed = TRUE)
  test2 <- nchar(header) == 0 #assumes empty lines don't contain whitespace characters
  if (any(!test1 & !test2)) {
    n <- n - sum(!test1 & !test2)
    break
  }
}

close(con)

header <- header[seq_len(n)]
header <- header[nchar(header) > 1]

header <- sub(pattern='[^"]*"[^"]*"[^"]*"', replacement='', header)
header <- gsub('"', '', header)

DT <- fread(file = filename, skip = n, header = FALSE)
setnames(DT, header)

print(DT)
#       aa           bb           cc           dd           ee ff
# 1: 0.000 0.0000000000 0.0000000000 0.0000000000 0.000000e+00  0
# 2: 0.001 0.0003727051 0.0002532203 0.0004736003 3.727051e-07  0
# 3: 0.002 0.0022667850 0.0015400810 0.0028804290 2.639490e-06  0
# 4: 0.003 0.0075385530 0.0051217860 0.0095793210 1.017804e-05  0
# 5: 0.004 0.0183883500 0.0124932900 0.0233662700 2.856639e-05  0
# 6: 0.005 0.0370329600 0.0251607300 0.0470581700 6.559935e-05  0
# 7: 0.006 0.0022667850 0.0015400810 0.0028804290 2.639490e-06  0
# 8: 0.007 0.0075385530 0.0051217860 0.0095793210 1.017804e-05  0
# 9: 0.008 0.0183883500 0.0124932900 0.0233662700 2.856639e-05  0
#10: 0.009 0.0370329600 0.0251607300 0.0470581700 6.559935e-05  0

Upvotes: 2

Using fread() instead of read_table() in R for large data sets

Answers (2)

Related Questions