kkerr
kkerr

Reputation: 31

How to read a very large text file (~15GB)?

I have a very large .txt file which I need to load into RStudio. The file contains ~850,000 Columns and ~2000 rows. Too large to open in Excel to convert to .csv. A description of the txt file contents:

SampleID,CpGprobeID1,CpGprobeID2(x850K)...
ID1,betavalue1(x850k)
ID2,betavalue1(x850K)...

Of this .txt file, I actually only need ~30000 of the columns (CpGprobeIDs). I have a .csv file containing those 30000 required CpGprobeIDs, listed in rows rather than columns (call this file CpGprobeIDsREDUCED.csv)

Is there a way that I can reduce the columns in the 15GB .txt file to only containing the ID's from the ~30000 rows CpGprobeIDsREDUCED.csv, or another way to load such a large file into RStudio?

So far I have unsucessfully tried:

library(data.table)
filename <- fread("filename.txt") 

^ R encounters a fatal error and the session aborts. This actually does work for a smaller sample size of ~24 rows, even with the 850K columns, so this is the closest I got.

So then I tried converting it to a csv file containing only the ~30000 required IDs and encountered the following error message:

library(readr)
CpGprobeIDsREDUCED <- read_csv("CpGprobeIDsREDUCED.csv")
library(sqldf)
CpGkeep <- CpGprobeIDsREDUCED$Name
import_filename <- fn$read.csv.sql("filename.txt",  sep="\t", sql = "select * from file where 
SampleID in ( `toString(CpGkeep)` )")

Error: too many columns on file
In addition: Warning message:
In for (i in seq_along(col)) col[i] <- length(scan(file, what = "", :
closing unused connection 3 (filename.txt)

I'm at a bit of a loss as to what to do next, any advice appreciated.

Thanks

Upvotes: 3

Views: 3584

Answers (1)

mhovd
mhovd

Reputation: 4087

This is where e.g. disk.frame may be used, but there are a few other options out there.

Setup your frame with

setup_disk.frame()
options(future.globals.maxSize = Inf)

Then load your data with

df_path = file.path(tempdir(), "tmp.df")

data <- csv_to_disk.frame(
  "filename.txt", 
  outdir = df_path,
  overwrite = T)

Also check out this example from their vignette

If the CSV is too large to read in, then we can also use the in_chunk_size option to control how many rows to read in at once. For example to read in the data 100,000 rows at a time.

library(nycflights13)
library(disk.frame)

# write a csv
csv_path = file.path(tempdir(), "tmp_flights.csv")

data.table::fwrite(flights, csv_path)

df_path = file.path(tempdir(), "tmp_flights.df")

flights.df <- csv_to_disk.frame(
  csv_path, 
  outdir = df_path, 
  in_chunk_size = 100000)
  
flights.df

Upvotes: 2

Related Questions