Reputation: 6766
I have a huge csv file. Its size is around 9 gb. I have 16 gb of ram. I followed the advises from the page and implemented them below.
If you get the error that R cannot allocate a vector of length x, close out of R and add the following line to the ``Target'' field:
--max-vsize=500M
Still I am getting the error and warnings below. How should I read the file of 9 gb into my R? I have R 64 bit 3.3.1 and I am running below command in the rstudio 0.99.903. I have windows server 2012 r2 standard, 64 bit os.
> memory.limit()
[1] 16383
> answer=read.csv("C:/Users/a-vs/results_20160291.csv")
Error: cannot allocate vector of size 500.0 Mb
In addition: There were 12 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
2: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
3: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
4: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
5: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
6: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
7: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
8: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
9: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
10: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
11: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
12: In scan(file = file, what = what, sep = sep, quote = quote, ... :
Reached total allocation of 16383Mb: see help(memory.size)
My 1st try based upon suggested answer
> thefile=fread("C:/Users/a-vs/results_20160291.csv", header = T)
Read 44099243 rows and 36 (of 36) columns from 9.399 GB file in 00:13:34
Warning messages:
1: In fread("C:/Users/a-vsingh/results_tendo_20160201_20160215.csv", :
Reached total allocation of 16383Mb: see help(memory.size)
2: In fread("C:/Users/a-vsingh/results_tendo_20160201_20160215.csv", :
Reached total allocation of 16383Mb: see help(memory.size)
my 2nd try based upon suggested answer is as below
thefile2 <- read.csv.ffdf(file="C:/Users/a-vs/results_20160291.csv", header=TRUE, VERBOSE=TRUE,
+ first.rows=-1, next.rows=50000, colClasses=NA)
read.table.ffdf 1..
Error: cannot allocate vector of size 125.0 Mb
In addition: There were 14 warnings (use warnings() to see them)
How could I read this file into a single object so that I can analyze the entire data in one go
We bought an expensive machine. It has 10 cores and 256 gb ram. That is not the most efficient solution but it works at least in near future. I looked at below answers and I dont think they solve my problem :( I appreciate these answers. I want to perform the market basket analysis and I dont think there is no other way around rather than keeping my data in RAM
Upvotes: 37
Views: 45496
Reputation: 7980
You could try splitting your processing over the table. Instead of operating on the whole thing, put the whole operation inside a for
loop and do it 16, 32, 64, or however many times you need to. Any values you need for later computation can be saved. This isn't as fast as other posts but it will definitely return.
x = number_of_rows_in_file / CHUNK_SIZE
for (i in c(from = 1, to = x, by = 1)) {
read.csv(con, nrows=CHUNK_SIZE,...)
}
Hope that helps.
Upvotes: 2
Reputation: 3947
You might want to consider leveraging some on-disk processing and not have that entire object in R's memory. One option would be to store the data in a proper database then have R access that. dplyr
is able to deal with a remote source (it actually writes the SQL statements to query the database). I've just tested this with a small example (a mere 17,500 rows), but hopefully it scales up to your requirements.
https://www.sqlite.org/download.html
import.sql
CREATE TABLE tableName (COL1, COL2, COL3, COL4);
.separator ,
.import YOURDATA.csv tableName
Yes, you'll need to specify the column names yourself (I believe) but you can specify their types here too if you wish. This won't work if you have commas anywhere in your names/data, of course.
sqlite3.exe BIGDATA.sqlite3 < import.sql
dplyr
to the SQLite databaseAs we're using SQLite, all of the dependencies are handled by dplyr
already.
library(dplyr)
my_db <- src_sqlite("/PATH/TO/YOUR/DB/BIGDATA.sqlite3", create = FALSE)
my_tbl <- tbl(my_db, "tableName")
dplyr
will write the SQLite commands needed to query this data source. It will otherwise behave like a local table. The big exception will be that you can't query the number of rows.
my_tbl %>% group_by(COL2) %>% summarise(meanVal = mean(COL3))
#> Source: query [?? x 2]
#> Database: sqlite 3.8.6 [/PATH/TO/YOUR/DB/BIGDATA.sqlite3]
#>
#> COL2 meanVal
#> <chr> <dbl>
#> 1 1979 15.26476
#> 2 1980 16.09677
#> 3 1981 15.83936
#> 4 1982 14.47380
#> 5 1983 15.36479
Upvotes: 18
Reputation: 668
This would be horrible practice, but depending on how you need to process this data, it shouldn't be too bad. You can change your maximum memory that R is allowed to use by calling memory.limit(new)
where new
an integer with R's new memory.limit
in MB. What will happen is when you hit the hardware constraint, windows will start paging memory onto the hard drive (not the worst thing in the world, but it will severely slow down your processing).
If you are running this on a server version of windows paging will possibly (likely) work different than from regular Windows 10. I believe it should be faster as the Server OS should be optimized for this stuff.
Try starting of with something along the lines of 32 GB (or memory.limit(memory.limit()*2)
) and if it comes out MUCH larger than that, I would say that the program will end up being too slow once it is loaded into memory. At that point I would recommend buying some more RAM or finding a way to process in parts.
Upvotes: 2
Reputation: 6372
This may not be possible on your computer. In certain cases, data.table
takes up more space than its .csv
counterpart.
DT <- data.table(x = sample(1:2,10000000,replace = T))
write.csv(DT, "test.csv") #29 MB file
DT <- fread("test.csv", row.names = F)
object.size(DT)
> 40001072 bytes #40 MB
Two OOM larger:
DT <- data.table(x = sample(1:2,1000000000,replace = T))
write.csv(DT, "test.csv") #2.92 GB file
DT <- fread("test.csv", row.names = F)
object.size(DT)
> 4000001072 bytes #4.00 GB
There is natural overhead to storing an object in R. Based on these numbers, there is roughly a 1.33 factor when reading files, However, this varies based on data. For example, using
x = sample(1:10000000,10000000,replace = T)
gives a factor roughly 2x (R:csv).
x = sample(c("foofoofoo","barbarbar"),10000000,replace = T)
gives a factor of 0.5x (R:csv).
Based on the max, your 9GB file would take a potential 18GB of memory to store in R, if not more. Based on your error message, it is far more likely that you are hitting hard memory constraints vs. an allocation issue. Therefore, just reading your file in chucks and consolidating would not work - you would also need to partition your analysis + workflow. Another alternative is to use an in-memory tool like SQL.
Upvotes: 12
Reputation: 23200
Make sure you're using 64-bit R, not just 64-bit Windows, so that you can increase your RAM allocation to all 16 GB.
In addition, you can read in the file in chunks:
file_in <- file("in.csv","r")
chunk_size <- 100000 # choose the best size for you
x <- readLines(file_in, n=chunk_size)
You can use data.table
to handle reading and manipulating large files more efficiently:
require(data.table)
fread("in.csv", header = T)
If needed, you can leverage storage memory with ff
:
library("ff")
x <- read.csv.ffdf(file="file.csv", header=TRUE, VERBOSE=TRUE,
first.rows=10000, next.rows=50000, colClasses=NA)
Upvotes: 32