Reputation: 179
I read many questions here on memory management. So I cleaned up my GB size data and narrowed it down to 32MB (770K rows) and store it on BigQuery now. But when I try to turn it into matrix, as.data.frame(str_split_fixed(event_list$event_list, ",", max(length(strsplit(event_list$event_list, ",")))))
, I get this error
Error: cannot allocate vector of size 4472.6 Gb
Is there anyway to fix this problem, what am I doing wrong here? Is that I store it on BigQuery or converting it to matrix increases the size?
Upvotes: 1
Views: 91
Reputation: 5283
@JosephWood nailed it. If event_list
has 700,000 rows, then you're trying to create a data.frame
with 700,000 rows and 700,000 columns. strsplit(event_list$event_list, ",")
would be a list of length 700,000, so length(strsplit(event_list$event_list, ","))
gives a single number: 700000
. max
of one number is just that number. You should use lengths
instead.
So your call to str_split_fixed
ends up acting like this:
str_split_fixed(event_list$event_list, ",", n = 700000)
That gives a list of 700,000 elements (length of event_list$event_list
), each element being a character vector with 700,000 values (n
).
On my machine, I roughly estimated the necessary memory:
format(700000 * object.size(character(700000)), "GB")
# [1] "3650.8 Gb"
That's not counting any extra memory required to store those vectors in a data.frame
.
The solution:
split_values <- strsplit(event_list$event_list, ",")
value_counts <- lengths(split_values)
extra_blanks <- lapply(max(value_counts) - value_counts, character)
values_with_blanks <- mapply(split_values, extra_blanks, FUN = c, SIMPLIFY = FALSE)
DF <- as.data.frame(values_with_blanks)
Upvotes: 2