Reputation: 25484
R seems to require four bytes of storage per integer, even for small ones:
> object.size(rep(1L, 10000))
40040 bytes
And, what is more, even for factors:
> object.size(factor(rep(1L, 10000)))
40456 bytes
I think, especially in the latter case this could be handled much better. Is there a solution that would help me reduce the storage requirements for this case to eight or even two bits per row? Perhaps a solution that uses the raw
type internally for storage but behaves like a normal factor otherwise. The bit
package offers this for bits, but I haven't found anything similar for factors.
My data frame with just a few millions of rows is consuming gigabytes, and that's a huge waste of memory and run time (!). Compression will reduce the required disk space, but again at the expense of run time.
Related:
Upvotes: 5
Views: 1047
Reputation: 8105
One other solution is using ff
. ff
supports the following vmodes/types (see ?vmode
):
‘boolean’ ‘as.boolean’ 1 bit logical without NA
‘logical’ ‘as.logical’ 2 bit logical with NA
‘quad’ ‘as.quad’ 2 bit unsigned integer without NA
‘nibble’ ‘as.nibble’ 4 bit unsigned integer without NA
‘byte’ ‘as.byte’ 8 bit signed integer with NA
‘ubyte’ ‘as.ubyte’ 8 bit unsigned integer without NA
‘short’ ‘as.short’ 16 bit signed integer with NA
‘ushort’ ‘as.ushort’ 16 bit unsigned integer without NA
‘integer’ ‘as.integer’ 32 bit signed integer with NA
‘single’ ‘as.single’ 32 bit float
‘double’ ‘as.double’ 64 bit float
‘complex’ ‘as.complex’ 2x64 bit float
‘raw’ ‘as.raw’ 8 bit unsigned char
‘character’ ‘as.character’ character
For example:
library(ff)
v <- ff(as.factor(sample(letters[1:4], 10000, replace=TRUE)), vmode="byte",
levels=letters[1:4])
This will use only one byte per element. An added advantage/disadvantage is that when the data becomes too large to store into memory it is automatically stored on disk (which of course will affect performance).
However, whatever solution you use, you will probably run into reduced performance. R internally uses integers for factors, so before calling any R-method the data will have to be translated from the compact storage to R's integers, which will cost. Unless, you only use methods specifically written for the compact storage type (these will probably have to be written in c/c++/...).
Upvotes: 3
Reputation: 49448
Since you mention raw
(and assuming you have less than 256 factor levels) - you could do the prerequisite conversion operations if memory is your bottleneck and CPU time isn't. For example:
f = factor(rep(1L, 1e5))
object.size(f)
# 400456 bytes
f.raw = as.raw(f)
object.size(f.raw)
#100040 bytes
# to go back:
identical(as.factor(as.integer(f.raw)), f)
#[1] TRUE
You can also save the factor levels separately and recover them if that's something you're interested in doing, but as far as grouping and all that goes you can just do it all with raw
and never go back to factors (except for presentation).
If you have specific use cases where you have trouble with this method, please post it, otherwise I think this should work just fine.
Here's a starting point for your byte.factor
class:
byte.factor = function(f) {
res = as.raw(f)
attr(res, "levels") <- levels(f)
attr(res, "class") <- "byte.factor"
res
}
as.factor.byte.factor = function(b) {
factor(attributes(b)$levels[as.integer(b)], attributes(b)$levels)
}
So you can do things like:
f = factor(c('a','b'), letters)
f
#[1] a b
#Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
b = byte.factor(f)
b
#[1] 01 02
#attr(,"levels")
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
#[20] "t" "u" "v" "w" "x" "y" "z"
#attr(,"class")
#[1] "byte.factor"
as.factor.byte.factor(b)
#[1] a b
#Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
Check out how data.table
overrides rbind.data.frame
if you want to make as.factor
generic and just add whatever functions you want to add. Should all be quite straightforward.
Upvotes: 6
Reputation: 46876
A little outside the box, but run-length encodings might be appropriate for long factors of few levels, provided the elements are ordered to some extent; this can be supported by the IRanges package in Bioconductor
rle = Rle(factor("A"), 1000000)
df = DataFrame(rle=rle)
and
> object.size(rle)
1528 bytes
DataFrame
and Rle
support all the standard operations, e.g., subsetting, addition of Rle's. Of course the size savings depend crucially on maintaining sorted order.
Upvotes: 2