Reputation: 1116
I have a dataset x
with 350m rows and 4 columns. When joining two columns from a dataset i
of 13m rows and 19 columns, I encounter the following error:
Internal logical error. DT passed to assign has not been allocated enough column slots. l=4, tl=4, adding 1
I have checked Not Enough Columns Slots but there the problem appears to be in the number of columns. Since I have only a few, I would be surprised if this was the issue.
Also, I found https://github.com/Rdatatable/data.table/issues/1830, where the error is related to "column slots", but I do not understand what they are. When checking truelength, I obtain
> truelength(x)
[1] 0
> truelength(i)
[1] 0
My understanding is that setting, for example, alloc.col(x,32)
or alloc.col(i,32)
, or both could solve the issue. However, I don`t understand what this does and and what the issue is. Can anyone offer an explanation?
Upvotes: 2
Views: 738
Reputation: 34703
Part of what makes data.table
so efficient is it tries to be smart about memory usage (whereas base
data.frames
tend to end up getting copied left and right in regular usage, e.g., setting names(DF) = col_names
can actually copy all of DF
despite only manipulating an attribute of the object).
Part of this, in turn, is that a data.table
is always allocated a certain size in memory to allow for adding/subtracting column pointers more fluidly (from a memory perspective).
So, while actual columns take memory greedily (when they're created, sufficient memory is claimed to store the nrow(DT)
-size vector), the column pointers, which store addresses where to find the actual data (you can think of this ~like~ column names, if you don't know the grittier details of pointers), have a fixed memory slot upon creation.
alloc.col
forces the column pointer address reserve process; this is most commonly used in two cases:
readRDS
/load
don't know to allocate this memory for a data.table
upon loading, we have to trigger this ourselves)I assume Frank is right and that you're experiencing the latter. See ?alloc.col
for some more details, but in most cases, you should just run alloc.col(x)
and alloc.col(i)
-- except for highly constrained machines, allocating 1024 column pointers requires relatively little memory, so you shouldn't spend to much effort skimping and trying to figure out the right quantity.
Upvotes: 4