Gotey
Gotey

Reputation: 629

Applying a function to a large data set

I have a large dataset I am reading in R I want to apply the Unique() function on it so I can work with it better, but when I try to do so, I get this prompted:

clients <- unique(clients)
Error: cannot allocate vector of size 27.9 Mb

So I am trying to apply this function part by part by doing this:

clientsmd<-data.frame()
n<-7316738  #Amount of observations in the dataset
t<-0
for(i in 1:200){
  clientsm<-clients[1+(t*round((n/200))):(t+1)*round((n/200)),]
  clientsm<-unique(clientsm)
  clientsmd<-rbind(clientsm)
  t<-(t+1) }

But I get this:

 Error in `[.default`(xj, i) : subscript too large for 32-bit R

I have been told that I could do this easier with packages such as "ff" or "bigmemory" (or any other) but I don't know how to use them for this purpose.

I'd thank any kind of orientation whether is to tell me why my code won't work or to say me how could I take advantage of this packages.

Upvotes: 0

Views: 867

Answers (3)

Sowmya S. Manian
Sowmya S. Manian

Reputation: 3833

increase your memory limit like below and then try executing.

 memory.limit(4000)   ## windows specific command

Upvotes: 1

iboboboru
iboboboru

Reputation: 1102

Is clients a data.frame of data.table? data.table can handle quite large amounts of data compared to data.frame

library(data.table)

clients<-data.table(clients)

clientsUnique<-unique(clients)

or

duplicateIndex <-duplicated(clients) 

will give rows that are duplicates.

Upvotes: 1

Pankaj Kaundal
Pankaj Kaundal

Reputation: 1022

You could use distinct function from dplyr package .

function - df %>% distinct(ID)

where ID is something unique in your dataframe .

Upvotes: 0

Related Questions