Maiasaura
Maiasaura

Reputation: 32986

How do I sub sample data by group using ddply?

I've got a data frame with far too many rows to be able to do a spatial correlogram. Instead, I want to grab 40 rows for each species and run my correlogram on that subset.

I wrote a function to subset a data frame as follows:

    samp <- function(dataf)
{
    dataf[sample(1:dim(dataf)[1], size=40, replace=FALSE),]
}

Now I want to apply this function to each species in a larger data frame.

When I try something like

culled_data = ddply (larger_data, .(species), subset, samp)

I get this error:

Error in subset.data.frame(piece, ...) : 
  'subset' must evaluate to logical

Anyone got ideas on how to do this?

Upvotes: 9

Views: 6416

Answers (2)

Marek
Marek

Reputation: 50704

Dirk answer is of course correct, but to add additional explanation I post my own.

Why your call don't work?

First of all your syntax is a shorthand. It's equivalent of

ddply(larger_data, .(species), function(dfrm) subset(dfrm, samp))

so you can clearly see that you provide function (see class(samp)) as second argument of subset. You could use samp(dfrm), but it won't work too cause samp return data.frame and subset need logical vector. So you could use samp(dfrm) when it returns logical indexing.

How to use subset in this case?

Make subset work by feed him with logical vector:

ddply (larger_data, .(species), subset, sample(seq_along(species)<=40))

I create logical vector with 40 TRUE (btw it works when for some spieces is less then 40 cases, then it return all) and random it.

Upvotes: 6

Dirk is no longer here
Dirk is no longer here

Reputation: 368241

It looks like it should work once you remove , subset from your call.

Upvotes: 6

Related Questions