bumblebee
bumblebee

Reputation: 1116

column slots in data.table

I have a dataset x with 350m rows and 4 columns. When joining two columns from a dataset i of 13m rows and 19 columns, I encounter the following error:

Internal logical error. DT passed to assign has not been allocated enough column slots. l=4, tl=4, adding 1

I have checked Not Enough Columns Slots but there the problem appears to be in the number of columns. Since I have only a few, I would be surprised if this was the issue.

Also, I found https://github.com/Rdatatable/data.table/issues/1830, where the error is related to "column slots", but I do not understand what they are. When checking truelength, I obtain

> truelength(x)
[1] 0
> truelength(i)
[1] 0

My understanding is that setting, for example, alloc.col(x,32) or alloc.col(i,32), or both could solve the issue. However, I don`t understand what this does and and what the issue is. Can anyone offer an explanation?

Upvotes: 2

Views: 738

Answers (1)

MichaelChirico
MichaelChirico

Reputation: 34703

Part of what makes data.table so efficient is it tries to be smart about memory usage (whereas base data.frames tend to end up getting copied left and right in regular usage, e.g., setting names(DF) = col_names can actually copy all of DF despite only manipulating an attribute of the object).

Part of this, in turn, is that a data.table is always allocated a certain size in memory to allow for adding/subtracting column pointers more fluidly (from a memory perspective).

So, while actual columns take memory greedily (when they're created, sufficient memory is claimed to store the nrow(DT)-size vector), the column pointers, which store addresses where to find the actual data (you can think of this ~like~ column names, if you don't know the grittier details of pointers), have a fixed memory slot upon creation.

alloc.col forces the column pointer address reserve process; this is most commonly used in two cases:

  1. Your data needs a lot of columns (by default, room is allocated for 1024 pointers more than there are columns at definition)
  2. You've loaded your data from RDS (since readRDS/load don't know to allocate this memory for a data.table upon loading, we have to trigger this ourselves)

I assume Frank is right and that you're experiencing the latter. See ?alloc.col for some more details, but in most cases, you should just run alloc.col(x) and alloc.col(i) -- except for highly constrained machines, allocating 1024 column pointers requires relatively little memory, so you shouldn't spend to much effort skimping and trying to figure out the right quantity.

Upvotes: 4

Related Questions