Efe
Efe

Reputation: 179

Workaround for memory allocation

I read many questions here on memory management. So I cleaned up my GB size data and narrowed it down to 32MB (770K rows) and store it on BigQuery now. But when I try to turn it into matrix, as.data.frame(str_split_fixed(event_list$event_list, ",", max(length(strsplit(event_list$event_list, ","))))), I get this error

Error: cannot allocate vector of size 4472.6 Gb

Is there anyway to fix this problem, what am I doing wrong here? Is that I store it on BigQuery or converting it to matrix increases the size?

Upvotes: 1

Views: 91

Answers (1)

Nathan Werth
Nathan Werth

Reputation: 5283

@JosephWood nailed it. If event_list has 700,000 rows, then you're trying to create a data.frame with 700,000 rows and 700,000 columns. strsplit(event_list$event_list, ",") would be a list of length 700,000, so length(strsplit(event_list$event_list, ",")) gives a single number: 700000. max of one number is just that number. You should use lengths instead.

So your call to str_split_fixed ends up acting like this:

str_split_fixed(event_list$event_list, ",", n = 700000)

That gives a list of 700,000 elements (length of event_list$event_list), each element being a character vector with 700,000 values (n).

On my machine, I roughly estimated the necessary memory:

format(700000 * object.size(character(700000)), "GB")
# [1] "3650.8 Gb"

That's not counting any extra memory required to store those vectors in a data.frame.

The solution:

split_values <- strsplit(event_list$event_list, ",")
value_counts <- lengths(split_values)
extra_blanks <- lapply(max(value_counts) - value_counts, character)
values_with_blanks <- mapply(split_values, extra_blanks, FUN = c, SIMPLIFY = FALSE)
DF <- as.data.frame(values_with_blanks)

Upvotes: 2

Related Questions