Reputation: 101
I'm encountering problems while trying to read a big .txt file (7.7 GB) into R. The file contains 817426 columns and more than 1000 rows. All variables are numeric. I tried out some different packages so far (data.table; vroom; bigreadr) with the commands fread; vroom; big_fread2.
With fread, I have been able to read the first 145 rows into my R session, but it crashes once I try to read 146 rows. For the other commands, the system just aborts after some time and the error message is:
R session aborted. R encountered a fatal error. The session was terminated
These are the codes I used so far:
system.time(dfUga <- fread("CpG_sexageres.txt", nrows=145, header = TRUE, sep = "\t", colClasses="numeric"))
system.time(dfUga <- vroom("CpG_sexageres.txt", col_names = TRUE))
system.time(dfUga <- big_fread2("CpG_sexageres.txt"))
Any suggestions are highly appreciated. Cheers
Upvotes: 3
Views: 1274
Reputation: 520918
R mainly operates completely in memory. This means that if the size of the data frame resulting from the read of the CSV would exceed available RAM, trying to read it will crash R. One option here would be to use a tool better suited for hosting such a large data set. A database is one option. You could load your data into a database, and then access it from R using an appropriate package.
If you do decide that you really need to work with the entire set, then most relational databases can probably be made to work here. For one example, MySQL is an option, and there is an RMySQL
package which can interface with a MySQL database.
However, you might not even need to really use the entire dataset all at once. If you are planning to do some statistical calculations on your dataset, and there is an even or random distribution of data with regard to line number in the CSV file, you might be able to just read a subset of that data into R. Here is a one way to sample every Nth row from an input file. Using your 7.7GB file as an example, if you were to read every 10th row only, you would end up with a 770MB data frame (roughly), which should be well within the memory limits of your R installation.
Upvotes: 5