Jeremy Leipzig
Jeremy Leipzig

Reputation: 1944

first row for non-aggregate functions

I use ddply to avoid redundant calculations.

I am often dealing with values that are conserved within the split subsets, and doing non-aggregate analysis. So to avoid this (a toy example):

ddply(baseball,.(id,year),function(x){paste(x$id,x$year,sep="_")})

Error in list_to_dataframe(res, attr(.data, "split_labels")) : 
  Results do not have equal lengths

I have to take the first row of each mini data frame.

ddply(baseball,function(x){paste(x$id[1],x$year[1],sep="_")})

Is there a different approach or a helper I should be using? This syntax seems awkward.

--

Note: paste in my example is just for show - don't take it too literally. Imagine this is actual function:

ddply(baseball,function(x){the_slowest_function_ever(x$id[1],x$year[1])})

Upvotes: 1

Views: 418

Answers (2)

Matt Dowle
Matt Dowle

Reputation: 59602

You might find data.table a little easier and faster in this case. The equivalent of .() variables is by= :

DT[, { paste(id,year,sep="_") }, by=list(id,year) ]

or

DT[, { do.call("paste",.BY) }, by=list(id,year) ]

I've shown the {} to illustrate you can put any (multi-line) anonymous body in j (rather than a function), but in these simple examples you don't need the {}.

The grouping variables are length 1 inside the scope of each group (which seems to be what you're asking), for speed and convenience. .BY contains the grouping variables in one list object as well, for generic access when the by criteria is decided programatically on the fly; i.e., when you don't know the by variables in advance.

Upvotes: 3

Brian Diggs
Brian Diggs

Reputation: 58825

You could use:

ddply(baseball, .(id, year), function(x){data.frame(paste(x$id,x$year,sep="_"))})

When you return a vector, putting it back together as a data.frame makes each entry a column. But there are different lengths, so they don't all have the same number of columns. By wrapping it in data.frame(), you make sure that your function returns a data.frame that has the column you want rather than relying on the implicit (and in this case, wrong) transformation. Also, you can name the new column easily within this construct.

UPDATE:

Given you only want to evaluate the function once (which is reasonable), then you can just pull the first row out by itself and operate on that.

ddply(baseball, .(id, year), function(x) {
  x <- x[1,]
  paste(x$id, x$year, sep="_")
})

This will (by itself) have only a single row for each id/year combo. If you want it to have the same number of rows as the original, then you can combine this with the previous idea.

ddply(baseball, .(id, year), function(x) {
  firstrow <- x[1,]
  data.frame(label=rep(paste(firstrow$id, firstrow$year, sep="_"), nrow(x)))
})

Upvotes: 1

Related Questions