Vince Gatto
Vince Gatto

Reputation: 415

Flatten/denormalize the result of the R aggregate function

I'm fairly new to R and I'm trying to use aggregate to perform some time series shaping on a dataframe, per subject and for each metric in my dataset. This works beautifully, but I find that the result isn't in a format that's very easy to use. I'd like to be able to transform the results back into the same format as the original dataframe.

Using the iris dataset as an example:

# Split into two data frames, one for metrics, the other for grouping
iris_species = subset(iris, select=Species)
iris_metrics = subset(iris, select=-Species)
# Compute diff for each metric with respect to its species
iris_diff = aggregate(iris_metrics, iris_species, diff)

I'm just using diff to illustrate that I have a function that shapes the time series, so I get a time series of possibly different length as a result and definitely not a single aggregate value (e.g. mean).

I'd like to transform the result, which seems to be a matrix that has list valued cells to the original "flat" dataframe.

I'm mostly curious about how to manage this with results from aggregate, but I'd be ok with solutions that do everything in plyr or reshape.

Upvotes: 5

Views: 1666

Answers (4)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

Here's what I understand as your problem: With your current method of using aggregate, you get a matrix for the results of "Sepal.Length", "Sepal.Width", and so on.

> str(iris_diff)
'data.frame':   3 obs. of  5 variables:
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 2 3
 $ Sepal.Length: num [1:3, 1:49] -0.2 -0.6 -0.5 -0.2 0.5 ...
 $ Sepal.Width : num [1:3, 1:49] -0.5 0 -0.6 0.2 -0.1 0.3 -0.1 -0.8 -0.1 0.5 ...
 $ Petal.Length: num [1:3, 1:49] 0 -0.2 -0.9 -0.1 0.4 ...
 $ Petal.Width : num [1:3, 1:49] 0 0.1 -0.6 0 0 0.2 0 -0.2 -0.3 0 ...

But, in your console, that displays as what looks like a data.frame with 197 columns.

You want to convert "iris_diff" to a data.frame with 197 columns. Here's how you can do that with your existing output (a trick I picked up from @James, here on SO):

do.call(data.frame, iris_diff)

Here are the first few lines of the output when we view the str of that action:

> str(do.call(data.frame, iris_diff))
'data.frame':   3 obs. of  197 variables:
 $ Species        : Factor w/ 3 levels "setosa","versicolor",..: 1 2 3
 $ Sepal.Length.1 : num  -0.2 -0.6 -0.5
 $ Sepal.Length.2 : num  -0.2 0.5 1.3
 $ Sepal.Length.3 : num  -0.1 -1.4 -0.8
 $ Sepal.Length.4 : num  0.4 1 0.2
 $ Sepal.Length.5 : num  0.4 -0.8 1.1
 $ Sepal.Length.6 : num  -0.8 0.6 -2.7
 $ Sepal.Length.7 : num  0.4 -1.4 2.4
 $ Sepal.Length.8 : num  -0.6 1.7 -0.6
 $ Sepal.Length.9 : num  0.5 -1.4 0.5
 $ Sepal.Length.10: num  0.5 -0.2 -0.7

Upvotes: 1

IRTFM
IRTFM

Reputation: 263332

If you wanted to return some sort of first differences vector with the same length as the input vector, you should be doing so with ave and an anonymous function. Since diff returns a vector of a different length you need to extend it with NA (or a marker of your choosing).

iris_diff = lapply(iris_metrics, 
        function(xx) ave(xx, iris_species, FUN=function(x) c(NA, diff(x) ) )  )
str(iris_diff)
#--------------
List of 4
 $ Sepal.Length: num [1:150] NA -0.2 -0.2 -0.1 0.4 ...
 $ Sepal.Width : num [1:150] NA -0.5 0.2 -0.1 0.5 0.3 -0.5 0 -0.5 0.2 ...
 $ Petal.Length: num [1:150] NA 0 -0.1 0.2 -0.1 ...
 $ Petal.Width : num [1:150] NA 0 0 0 0 0.2 -0.1 -0.1 0 -0.1 ...

If you wanted that as a dataframe just wrap data.frame around it. And it would be a good idea to include the original grouping vector:

iris_diff <- data.frame( Species= iris_species, iris_diff)
str(iris_diff)
#------
'data.frame':   150 obs. of  5 variables:
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Sepal.Length: num  NA -0.2 -0.2 -0.1 0.4 ...
 $ Sepal.Width : num  NA -0.5 0.2 -0.1 0.5 0.3 -0.5 0 -0.5 0.2 ...
 $ Petal.Length: num  NA 0 -0.1 0.2 -0.1 ...
 $ Petal.Width : num  NA 0 0 0 0 0.2 -0.1 -0.1 0 -0.1 ...

Upvotes: 1

Arun
Arun

Reputation: 118789

The best solution I could think of in this case is data.table:

require(data.table)
dt <- data.table(iris, key="Species")
dt.out <- dt[, lapply(.SD, diff), by=Species]

And if you want a plyr solution, then the idea is basically the same. Split by Species and apply diff to each column.

require(plyr)
ddply(iris, .(Species), function(x) do.call(cbind, lapply(x[,1:4], diff)))

Upvotes: 2

Matthew Lundberg
Matthew Lundberg

Reputation: 42639

As you might know, aggregate works on one column at a time. A single value is expected, and odd things happen if you return vectors of length different from 1.

You can split this up with by to get the data (with fewer rows than in iris) and put it back together:

b <- by(iris_metrics, iris_species, FUN=function(x) diff(as.matrix(x)))
do.call(rbind, lapply(names(b), function(x) data.frame(Species=x, b[[x]])))

diff(as.matrix) is used as this does what you want for matrices (but not for data frames). The key point is that the function returns a different number of rows than are in each Species in iris.

Upvotes: 2

Related Questions