Reputation: 31
I have a very large .txt file which I need to load into RStudio. The file contains ~850,000 Columns and ~2000 rows. Too large to open in Excel to convert to .csv. A description of the txt file contents:
SampleID,CpGprobeID1,CpGprobeID2(x850K)...
ID1,betavalue1(x850k)
ID2,betavalue1(x850K)...
Of this .txt file, I actually only need ~30000 of the columns (CpGprobeIDs). I have a .csv file containing those 30000 required CpGprobeIDs, listed in rows rather than columns (call this file CpGprobeIDsREDUCED.csv)
Is there a way that I can reduce the columns in the 15GB .txt file to only containing the ID's from the ~30000 rows CpGprobeIDsREDUCED.csv, or another way to load such a large file into RStudio?
So far I have unsucessfully tried:
library(data.table)
filename <- fread("filename.txt")
^ R encounters a fatal error and the session aborts. This actually does work for a smaller sample size of ~24 rows, even with the 850K columns, so this is the closest I got.
So then I tried converting it to a csv file containing only the ~30000 required IDs and encountered the following error message:
library(readr)
CpGprobeIDsREDUCED <- read_csv("CpGprobeIDsREDUCED.csv")
library(sqldf)
CpGkeep <- CpGprobeIDsREDUCED$Name
import_filename <- fn$read.csv.sql("filename.txt", sep="\t", sql = "select * from file where
SampleID in ( `toString(CpGkeep)` )")
Error: too many columns on file
In addition: Warning message:
In for (i in seq_along(col)) col[i] <- length(scan(file, what = "", :
closing unused connection 3 (filename.txt)
I'm at a bit of a loss as to what to do next, any advice appreciated.
Thanks
Upvotes: 3
Views: 3584
Reputation: 4087
This is where e.g. disk.frame
may be used, but there are a few other options out there.
Setup your frame with
setup_disk.frame()
options(future.globals.maxSize = Inf)
Then load your data with
df_path = file.path(tempdir(), "tmp.df")
data <- csv_to_disk.frame(
"filename.txt",
outdir = df_path,
overwrite = T)
Also check out this example from their vignette
If the CSV is too large to read in, then we can also use the in_chunk_size option to control how many rows to read in at once. For example to read in the data 100,000 rows at a time.
library(nycflights13)
library(disk.frame)
# write a csv
csv_path = file.path(tempdir(), "tmp_flights.csv")
data.table::fwrite(flights, csv_path)
df_path = file.path(tempdir(), "tmp_flights.df")
flights.df <- csv_to_disk.frame(
csv_path,
outdir = df_path,
in_chunk_size = 100000)
flights.df
Upvotes: 2