Reputation: 415
I'm fairly new to R and I'm trying to use aggregate
to perform some time series shaping on a dataframe, per subject and for each metric in my dataset. This works beautifully, but I find that the result isn't in a format that's very easy to use. I'd like to be able to transform the results back into the same format as the original dataframe.
Using the iris dataset as an example:
# Split into two data frames, one for metrics, the other for grouping
iris_species = subset(iris, select=Species)
iris_metrics = subset(iris, select=-Species)
# Compute diff for each metric with respect to its species
iris_diff = aggregate(iris_metrics, iris_species, diff)
I'm just using diff
to illustrate that I have a function that shapes the time series, so I get a time series of possibly different length as a result and definitely not a single aggregate value (e.g. mean).
I'd like to transform the result, which seems to be a matrix that has list valued cells to the original "flat" dataframe.
I'm mostly curious about how to manage this with results from aggregate
, but I'd be ok with solutions that do everything in plyr
or reshape
.
Upvotes: 5
Views: 1666
Reputation: 193517
Here's what I understand as your problem: With your current method of using aggregate
, you get a matrix
for the results of "Sepal.Length", "Sepal.Width", and so on.
> str(iris_diff)
'data.frame': 3 obs. of 5 variables:
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 2 3
$ Sepal.Length: num [1:3, 1:49] -0.2 -0.6 -0.5 -0.2 0.5 ...
$ Sepal.Width : num [1:3, 1:49] -0.5 0 -0.6 0.2 -0.1 0.3 -0.1 -0.8 -0.1 0.5 ...
$ Petal.Length: num [1:3, 1:49] 0 -0.2 -0.9 -0.1 0.4 ...
$ Petal.Width : num [1:3, 1:49] 0 0.1 -0.6 0 0 0.2 0 -0.2 -0.3 0 ...
But, in your console, that displays as what looks like a data.frame
with 197 columns.
You want to convert "iris_diff" to a data.frame
with 197 columns. Here's how you can do that with your existing output (a trick I picked up from @James, here on SO):
do.call(data.frame, iris_diff)
Here are the first few lines of the output when we view the str
of that action:
> str(do.call(data.frame, iris_diff))
'data.frame': 3 obs. of 197 variables:
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 2 3
$ Sepal.Length.1 : num -0.2 -0.6 -0.5
$ Sepal.Length.2 : num -0.2 0.5 1.3
$ Sepal.Length.3 : num -0.1 -1.4 -0.8
$ Sepal.Length.4 : num 0.4 1 0.2
$ Sepal.Length.5 : num 0.4 -0.8 1.1
$ Sepal.Length.6 : num -0.8 0.6 -2.7
$ Sepal.Length.7 : num 0.4 -1.4 2.4
$ Sepal.Length.8 : num -0.6 1.7 -0.6
$ Sepal.Length.9 : num 0.5 -1.4 0.5
$ Sepal.Length.10: num 0.5 -0.2 -0.7
Upvotes: 1
Reputation: 263332
If you wanted to return some sort of first differences vector with the same length as the input vector, you should be doing so with ave and an anonymous function. Since diff
returns a vector of a different length you need to extend it with NA (or a marker of your choosing).
iris_diff = lapply(iris_metrics,
function(xx) ave(xx, iris_species, FUN=function(x) c(NA, diff(x) ) ) )
str(iris_diff)
#--------------
List of 4
$ Sepal.Length: num [1:150] NA -0.2 -0.2 -0.1 0.4 ...
$ Sepal.Width : num [1:150] NA -0.5 0.2 -0.1 0.5 0.3 -0.5 0 -0.5 0.2 ...
$ Petal.Length: num [1:150] NA 0 -0.1 0.2 -0.1 ...
$ Petal.Width : num [1:150] NA 0 0 0 0 0.2 -0.1 -0.1 0 -0.1 ...
If you wanted that as a dataframe just wrap data.frame
around it. And it would be a good idea to include the original grouping vector:
iris_diff <- data.frame( Species= iris_species, iris_diff)
str(iris_diff)
#------
'data.frame': 150 obs. of 5 variables:
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Sepal.Length: num NA -0.2 -0.2 -0.1 0.4 ...
$ Sepal.Width : num NA -0.5 0.2 -0.1 0.5 0.3 -0.5 0 -0.5 0.2 ...
$ Petal.Length: num NA 0 -0.1 0.2 -0.1 ...
$ Petal.Width : num NA 0 0 0 0 0.2 -0.1 -0.1 0 -0.1 ...
Upvotes: 1
Reputation: 118789
The best solution I could think of in this case is data.table
:
require(data.table)
dt <- data.table(iris, key="Species")
dt.out <- dt[, lapply(.SD, diff), by=Species]
And if you want a plyr
solution, then the idea is basically the same. Split by Species
and apply diff
to each column.
require(plyr)
ddply(iris, .(Species), function(x) do.call(cbind, lapply(x[,1:4], diff)))
Upvotes: 2
Reputation: 42639
As you might know, aggregate
works on one column at a time. A single value is expected, and odd things happen if you return vectors of length different from 1.
You can split this up with by
to get the data (with fewer rows than in iris
) and put it back together:
b <- by(iris_metrics, iris_species, FUN=function(x) diff(as.matrix(x)))
do.call(rbind, lapply(names(b), function(x) data.frame(Species=x, b[[x]])))
diff(as.matrix)
is used as this does what you want for matrices (but not for data frames). The key point is that the function returns a different number of rows than are in each Species
in iris
.
Upvotes: 2