Reputation: 1953
I've been trying to learn about the different data structures used in popular languages that I have experience with such as lists and dictionaries in Python, associative arrays in PHP (essentially hash tables), vectors in C++, etc.
I have a lot of colleagues that use R religiously and I was wondering how vectors, matrices, and data frames are implemented in R. What are their strengths and weaknesses? I was looking through source code but I couldn't find the data structures themselves. Where in the source code are these definitions located?
Upvotes: 9
Views: 1865
Reputation: 363
A little late, but wanted to point out a mistake with one of the other answers and to give an explicit answer. Look at internals manual:
https://cran.r-project.org/doc/manuals/R-ints.html#The-_0027data_0027
Read the beginning of this section, and the entry for 'INTSXP'. It seems integer vectors are implemented as an array of C int. Similarly for 'REALSXP' and 'CHARSXP'.
Implementing it as linked lists would've been prohibitively slow.
Upvotes: 0
Reputation: 1953
From R Internals, 1.1 SEXPs:
... the basic building blocks of R objects are often called nodes... Both types of node structure have as their first three fields a 32-bit spxinfo header and then three pointers (to the attributes and the previous and next node in a doubly-linked list)
So vectors in R are implemented as a doubly-linked list. And, it even appears that there is no data structure smaller than a single-node linked list. This is evident by:
> a <- 4
> a[1]
4
As mentioned by others: builtin.c
has do_makevector
and do_makelist
, and array.c
has the source for do_matrix
. In addition array.c
contains source for allocMatrix
and memory.c
contains the source for allocVector
.
While a lot of things that were going on were over my head, it seems evident that a matrix is simply a doubly-linked list of doubly-linked lists. I believe (though am unsure) that row and column names (like those stored in a data frame) are stored in the 'attributes' of each list.
The response to the "what the strengths and weaknesses" of the implementation of the data structures would be that (from my limited knowledge) doubly linked lists have a strength in that dynamic memory allocation is simpler and doesn't require the overhead of copying and reallocating an entire array, and the weakness being that (depending on how many pointers there are to the list: head, tail, middle, quarters, etc.) accessing a random value v[99]
can take the overhead of iterating through several elements before the desired one is found.
Is this correct?
Upvotes: 1
Reputation: 3963
As already mentioned, check out the "R internals" manual, as well as this part of "Writing R extensions".
Upvotes: 4