grasingerm
grasingerm

Reputation: 1953

how are vectors, matrices, and data frames implemented in R?

I've been trying to learn about the different data structures used in popular languages that I have experience with such as lists and dictionaries in Python, associative arrays in PHP (essentially hash tables), vectors in C++, etc.

I have a lot of colleagues that use R religiously and I was wondering how vectors, matrices, and data frames are implemented in R. What are their strengths and weaknesses? I was looking through source code but I couldn't find the data structures themselves. Where in the source code are these definitions located?

Upvotes: 9

Views: 1865

Answers (3)

akshan
akshan

Reputation: 363

A little late, but wanted to point out a mistake with one of the other answers and to give an explicit answer. Look at internals manual:

https://cran.r-project.org/doc/manuals/R-ints.html#The-_0027data_0027

Read the beginning of this section, and the entry for 'INTSXP'. It seems integer vectors are implemented as an array of C int. Similarly for 'REALSXP' and 'CHARSXP'.

Implementing it as linked lists would've been prohibitively slow.

Upvotes: 0

grasingerm
grasingerm

Reputation: 1953

From R Internals, 1.1 SEXPs:

... the basic building blocks of R objects are often called nodes... Both types of node structure have as their first three fields a 32-bit spxinfo header and then three pointers (to the attributes and the previous and next node in a doubly-linked list)

So vectors in R are implemented as a doubly-linked list. And, it even appears that there is no data structure smaller than a single-node linked list. This is evident by:

> a <- 4
> a[1]
4

As mentioned by others: builtin.c has do_makevector and do_makelist, and array.c has the source for do_matrix. In addition array.c contains source for allocMatrix and memory.c contains the source for allocVector.

While a lot of things that were going on were over my head, it seems evident that a matrix is simply a doubly-linked list of doubly-linked lists. I believe (though am unsure) that row and column names (like those stored in a data frame) are stored in the 'attributes' of each list.

The response to the "what the strengths and weaknesses" of the implementation of the data structures would be that (from my limited knowledge) doubly linked lists have a strength in that dynamic memory allocation is simpler and doesn't require the overhead of copying and reallocating an entire array, and the weakness being that (depending on how many pointers there are to the list: head, tail, middle, quarters, etc.) accessing a random value v[99] can take the overhead of iterating through several elements before the desired one is found.

Is this correct?

Upvotes: 1

Theodore Lytras
Theodore Lytras

Reputation: 3963

As already mentioned, check out the "R internals" manual, as well as this part of "Writing R extensions".

Upvotes: 4

Related Questions