Reputation: 11
Im working with a large dataset (3.5M lines and 40 columns) and I need to clean out some values so I´ll be able to calculate other parameters that I are necessary when I start formulating a model around the data.
The problem is that it is taking forever to apply the for loops that I have been using so I wanted to try to make use of the ff package. The dataframe is called data and it consists of bunch of customer information for a bank. It was imported as a .csv file. What I need to do is remove all customers (labeled Serial) if their AverageStanding variable is ever negative
> ffd<-as.ffdf(data)
> lastserial = tail(ffd$Serial,1)
> for(k in 1:lastserial){
+ tempvecWith <- vector()
+ tempvecWith <- ffd[ffd$Serial==k, ]$AverageStanding
+ if(any(tempvecWith < 0)){
+ ffd_clean<- ffd[!ffd$Serial ==k, ]
+ }
+ }
This is the error that I am receiving:
Error in as.hi.integer(x, maxindex = maxindex, dim = dim, vw = vw, pack = pack) :
NAs in as.hi.integer
Any ideas on how I can avoid these errors?
Upvotes: 1
Views: 622
Reputation:
The error comes from this part of your code ffd[ffd$Serial==k, ]
. Namely ffd$Serial==k
returns an ff logical vector. But if you want to index or subset an ff vector or ffdf, you need to supply the index numbers, not a vector of logicals. You can turn your ff vector of logicals into an ff vector of index numbers by using ffwhich from package ffbase.
So for your questions, I believe you are looking for this kind of code (not tested as you did not supply any data).
require(ffbase)
idx <- ffd$AverageStanding < 0
idx <- ffwhich(idx, idx==TRUE)
open(ffd)
serials.with.negative <- ffd$Serial[idx]
serials.with.negative <- unique(serials.with.negative)
ffd$is.customer.with.negative.avgstanding <- ffd$Serial %in% serials.with.negative
idx <- ffd$is.customer.with.negative.avgstanding == FALSE
idx <- ffwhich(idx, idx==TRUE)
open(ffd)
ffd_clean <- ffd[idx, ]
Upvotes: 1