krlmlr
krlmlr

Reputation: 25484

Save storage space for small integers or factors with few levels

R seems to require four bytes of storage per integer, even for small ones:

> object.size(rep(1L, 10000))
40040 bytes

And, what is more, even for factors:

> object.size(factor(rep(1L, 10000)))
40456 bytes

I think, especially in the latter case this could be handled much better. Is there a solution that would help me reduce the storage requirements for this case to eight or even two bits per row? Perhaps a solution that uses the raw type internally for storage but behaves like a normal factor otherwise. The bit package offers this for bits, but I haven't found anything similar for factors.

My data frame with just a few millions of rows is consuming gigabytes, and that's a huge waste of memory and run time (!). Compression will reduce the required disk space, but again at the expense of run time.

Related:

Upvotes: 5

Views: 1047

Answers (3)

Jan van der Laan
Jan van der Laan

Reputation: 8105

One other solution is using ff. ff supports the following vmodes/types (see ?vmode):

 ‘boolean’    ‘as.boolean’    1 bit logical without NA           
 ‘logical’    ‘as.logical’    2 bit logical with NA              
 ‘quad’       ‘as.quad’       2 bit unsigned integer without NA  
 ‘nibble’     ‘as.nibble’     4 bit unsigned integer without NA  
 ‘byte’       ‘as.byte’       8 bit signed integer with NA       
 ‘ubyte’      ‘as.ubyte’      8 bit unsigned integer without NA  
 ‘short’      ‘as.short’      16 bit signed integer with NA      
 ‘ushort’     ‘as.ushort’     16 bit unsigned integer without NA 
 ‘integer’    ‘as.integer’    32 bit signed integer with NA      
 ‘single’     ‘as.single’     32 bit float                       
 ‘double’     ‘as.double’     64 bit float                       
 ‘complex’    ‘as.complex’    2x64 bit float                     
 ‘raw’        ‘as.raw’        8 bit unsigned char                
 ‘character’  ‘as.character’  character

For example:

library(ff)
v <- ff(as.factor(sample(letters[1:4], 10000, replace=TRUE)), vmode="byte", 
    levels=letters[1:4])

This will use only one byte per element. An added advantage/disadvantage is that when the data becomes too large to store into memory it is automatically stored on disk (which of course will affect performance).

However, whatever solution you use, you will probably run into reduced performance. R internally uses integers for factors, so before calling any R-method the data will have to be translated from the compact storage to R's integers, which will cost. Unless, you only use methods specifically written for the compact storage type (these will probably have to be written in c/c++/...).

Upvotes: 3

eddi
eddi

Reputation: 49448

Since you mention raw (and assuming you have less than 256 factor levels) - you could do the prerequisite conversion operations if memory is your bottleneck and CPU time isn't. For example:

f = factor(rep(1L, 1e5))
object.size(f)
# 400456 bytes

f.raw = as.raw(f)
object.size(f.raw)
#100040 bytes

# to go back:
identical(as.factor(as.integer(f.raw)), f)
#[1] TRUE

You can also save the factor levels separately and recover them if that's something you're interested in doing, but as far as grouping and all that goes you can just do it all with raw and never go back to factors (except for presentation).

If you have specific use cases where you have trouble with this method, please post it, otherwise I think this should work just fine.


Here's a starting point for your byte.factor class:

byte.factor = function(f) {
  res = as.raw(f)
  attr(res, "levels") <- levels(f)
  attr(res, "class") <- "byte.factor"
  res
}

as.factor.byte.factor = function(b) {
  factor(attributes(b)$levels[as.integer(b)], attributes(b)$levels)
}

So you can do things like:

f = factor(c('a','b'), letters)
f
#[1] a b
#Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

b = byte.factor(f)
b
#[1] 01 02
#attr(,"levels")
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
#[20] "t" "u" "v" "w" "x" "y" "z"
#attr(,"class")
#[1] "byte.factor"

as.factor.byte.factor(b)
#[1] a b
#Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

Check out how data.table overrides rbind.data.frame if you want to make as.factor generic and just add whatever functions you want to add. Should all be quite straightforward.

Upvotes: 6

Martin Morgan
Martin Morgan

Reputation: 46876

A little outside the box, but run-length encodings might be appropriate for long factors of few levels, provided the elements are ordered to some extent; this can be supported by the IRanges package in Bioconductor

rle = Rle(factor("A"), 1000000)
df = DataFrame(rle=rle)

and

> object.size(rle)
1528 bytes

DataFrame and Rle support all the standard operations, e.g., subsetting, addition of Rle's. Of course the size savings depend crucially on maintaining sorted order.

Upvotes: 2

Related Questions