Reputation: 1187
The basic idea is this: I have a large ffdf
(about 5.5 million x 136 fields). I know for a fact that some of these columns in this data frame have columns which are all NA
. How do I find out which ones and remove them appropriately?
My instinct is to do something like (assuming df
is the ffdf
):
apply(X=is.na(df[,1:136]), MARGIN = 2, FUN = sum)
which should give me a vector of the NA
counts for each column, and then I could find which ones have ~5.5 million NA
values, remove them using df <- df[,-c(vector of columns)]
, etc. Pretty straightforward.
However, apply
gives me an error.
Error: cannot allocate vector of size 21.6 Mb
In addition: Warning messages:
1: In `[.ff`(p, i2) :
Reached total allocation of 3889Mb: see help(memory.size)
2: In `[.ff`(p, i2) :
Reached total allocation of 3889Mb: see help(memory.size)
3: In `[.ff`(p, i2) :
Reached total allocation of 3889Mb: see help(memory.size)
4: In `[.ff`(p, i2) :
Reached total allocation of 3889Mb: see help(memory.size)
This tells me that apply
can't handle a data frame of this size. Are there any alternatives I can use?
Upvotes: 3
Views: 156
Reputation: 8105
It is easier to use all(is.na(column))
. sapply
/lapply
donot work because and ffdf
object is not a list.
You use df[, 1:136]
in your code. This will cause ff
to try to load all 136 columns into memory. This is what causes the memory issues. This does not happen when you do df[1:136]
. The same happens when indexing for the final result: df <- df[,-c(vector of columns)]
reads all selected columns into memory.
na_cols <- logical(136)
for (i in seq_len(136)) {
na_cols[i] <- all(is.na(df[[i]]))
}
res <- df[!na_cols]
Upvotes: 1
Reputation: 56169
Try this example:
#dummy data
df <- sample(1000000*5)
df <- data.frame( matrix(df,nrow = 1000000))
df$X3 <- NA
df$X6 <- NA
#list of col to remove or keep
colToRemove <- colnames(df)[ colSums(is.na(df[ ,1:6])) == nrow(df) ]
colToKeep <- setdiff(colnames(df), colToRemove)
#subset
res <- df[, colToKeep]
colnames(df)
#[1] "X1" "X2" "X3" "X4" "X5" "X6"
colnames(res)
#[1] "X1" "X2" "X4" "X5"
Upvotes: 0