Reputation: 11762
I am trying to use ddply on some columns with a regular expression and I could not get this to work. I prepared a little example below. Is there a way use ddply on several variables, or did I just miss something in the manual?
df <- data.frame(low_1=rnorm(5),low_2=rnorm(5),high_1=rnorm(5),high_2=rnorm(5),N=c(1,2,3,4,5))
ddply(df,.(N), summarise, low=mean("low.."), high=mean("high.."))
Upvotes: 1
Views: 255
Reputation: 24074
You can do something like :
ddply(df,.(N), summarise,
low=mean(sapply(grep("low",colnames(df),value=T),function(x){get(x)})),
high=mean(sapply(grep("high",colnames(df),value=T),function(x){get(x)})))
which gives this output :
N low high
1 1 0.94613752 1.47197645
2 2 -0.68887596 -0.05779876
3 3 -0.28589753 -0.55694341
4 4 -0.01378869 0.28204629
5 5 -0.08681600 0.88544497
data :
> dput(df)
structure(list(low_1 = c(0.885675347945903, -1.30343272566325, -2.44201300062675, -1.27709377574332, -0.794159839824383),
low_2 = c(1.00659968581264,-0.0743191876393787, 1.87021794472605, 1.24951638739919, 0.620527846366092),
high_1 = c(0.630374573470948, 0.169009703225843, -0.573629421621814, 0.340752780334754, 0.417022085050569),
high_2 = c(2.31357832822303,-0.284607218026423, -0.540257400090053, 0.223339795927736, 1.35386785598766),
N = c(1, 2, 3, 4, 5)),
.Names = c("low_1", "low_2", "high_1", "high_2", "N"),
row.names = c(NA, -5L), class = "data.frame")
Upvotes: 0
Reputation: 70286
Here's an approach with dplyr and tidyr that I think results in the desired output:
require(dplyr) # if not yet installed, first run: install.packages("dplyr")
require(tidyr) # if not yet installed, first run: install.packages("tidyr")
gather(df, group, val, -N) %>% # reshape the data to long format
mutate(group = gsub("*_\\d+$", "", group)) %>% # delete the numbers from low_x and high_x in the "group" column
group_by(N, group) %>% # group the data based on N and group (low/high)
summarise(val = mean(val)) %>% # apply the mean
ungroup() %>% # ungroup the data
spread(group, val) # reshape to wide format so that low and high are separate columns
#Source: local data frame [5 x 3]
#
# N high low
#1 1 0.29702057 0.15541153
#2 2 -1.02057669 1.09399446
#3 3 0.20745563 0.11582517
#4 4 -0.05573833 -0.22570064
#5 5 0.61697307 -0.06831203
It will work with any number of low_X and high_X columns.
Note: make sure you load dplyr after plyr to avoid function name conflicts.
set.seed(4711)
df <- data.frame(low_1=rnorm(5),low_2=rnorm(5),high_1=rnorm(5),high_2=rnorm(5),N=c(1,2,3,4,5))
Upvotes: 0
Reputation: 179448
You can use colwise
to calculate the same statistic on multiple columns, for example:
ddply(df, .(N), colwise(mean))
N low_1 low_2 high_1 high_2
1 1 -1.3105923 -0.5507862 0.6304232 -0.04553457
2 2 -0.1586676 0.6820199 -0.8220206 0.93301381
3 3 0.4434761 0.4337073 -1.2988521 0.84412693
4 4 0.2522467 -0.1393690 0.2361361 1.64288051
5 5 0.4118032 0.4358705 -0.3529169 0.98916518
To use a regular expression on the column names, you can do something like the following:
grep()
to identify all columns you're interested in.ddply
, where the subset consists of only those columns identified in steps 1 and 2.Try this:
idx <- grep("low", names(df))
idk <- which(names(df) == "N")
ddply(df[, c(idx, idk)], .(N), colwise(mean))
N low_1 low_2
1 1 -1.3105923 -0.5507862
2 2 -0.1586676 0.6820199
3 3 0.4434761 0.4337073
4 4 0.2522467 -0.1393690
5 5 0.4118032 0.4358705
Upvotes: 1
Reputation: 121077
As it stands, you need to pass a different argument for each statistic that you are calculating.
ddply(
df,
.(N),
summarise,
low_1 = mean(low_1),
low_2 = mean(low_2),
high_1 = mean(high_1),
high_2 = mean(high_2)
)
The idiomatic way of calculating this is to reshape your data to long format before calculating the stats.
library(plyr)
library(reshape2)
library(stringr)
df_long <- melt(df, id.vars = "N")
matches <- str_match(df_long$variable, "(low|high)_([[:digit:]])")
df_long <- within(
df_long,
{
height <- matches[, 2]
group <- as.integer(matches[, 3])
}
)
ddply(
df_long,
.(N, height, group),
summarize,
mean_value = mean(value)
)
If you prefer, you can use mutate
rather than within
, and call to ddply
can be replaced with modern dplyr
syntax.
df_long %>%
group_by(N, height, group) %>%
summarize(mean_value = mean(value))
Upvotes: 0