drmariod
drmariod

Reputation: 11762

using regex in ddply variables

I am trying to use ddply on some columns with a regular expression and I could not get this to work. I prepared a little example below. Is there a way use ddply on several variables, or did I just miss something in the manual?

df <- data.frame(low_1=rnorm(5),low_2=rnorm(5),high_1=rnorm(5),high_2=rnorm(5),N=c(1,2,3,4,5))
ddply(df,.(N), summarise, low=mean("low.."), high=mean("high.."))

Upvotes: 1

Views: 255

Answers (4)

Cath
Cath

Reputation: 24074

You can do something like :

ddply(df,.(N), summarise, 
      low=mean(sapply(grep("low",colnames(df),value=T),function(x){get(x)})), 
      high=mean(sapply(grep("high",colnames(df),value=T),function(x){get(x)})))

which gives this output :

  N         low        high
1 1  0.94613752  1.47197645
2 2 -0.68887596 -0.05779876
3 3 -0.28589753 -0.55694341
4 4 -0.01378869  0.28204629
5 5 -0.08681600  0.88544497

data :

> dput(df)
structure(list(low_1 = c(0.885675347945903, -1.30343272566325, -2.44201300062675, -1.27709377574332, -0.794159839824383), 
               low_2 = c(1.00659968581264,-0.0743191876393787, 1.87021794472605, 1.24951638739919, 0.620527846366092), 
               high_1 = c(0.630374573470948, 0.169009703225843, -0.573629421621814, 0.340752780334754, 0.417022085050569), 
               high_2 = c(2.31357832822303,-0.284607218026423, -0.540257400090053, 0.223339795927736, 1.35386785598766), 
               N = c(1, 2, 3, 4, 5)), 
               .Names = c("low_1", "low_2", "high_1", "high_2", "N"), 
               row.names = c(NA, -5L), class = "data.frame")

Upvotes: 0

talat
talat

Reputation: 70286

Here's an approach with dplyr and tidyr that I think results in the desired output:

require(dplyr) # if not yet installed, first run: install.packages("dplyr")
require(tidyr) # if not yet installed, first run: install.packages("tidyr")

gather(df, group, val, -N) %>%     # reshape the data to long format
  mutate(group = gsub("*_\\d+$", "", group)) %>%   # delete the numbers from low_x and high_x in the "group" column
  group_by(N, group) %>%           # group the data based on N and group (low/high)
  summarise(val = mean(val)) %>%   # apply the mean
  ungroup() %>%                    # ungroup the data
  spread(group, val)               # reshape to wide format so that low and high are separate columns

#Source: local data frame [5 x 3]
#
#  N        high         low
#1 1  0.29702057  0.15541153
#2 2 -1.02057669  1.09399446
#3 3  0.20745563  0.11582517
#4 4 -0.05573833 -0.22570064
#5 5  0.61697307 -0.06831203

It will work with any number of low_X and high_X columns.

Note: make sure you load dplyr after plyr to avoid function name conflicts.

data

set.seed(4711)
df <- data.frame(low_1=rnorm(5),low_2=rnorm(5),high_1=rnorm(5),high_2=rnorm(5),N=c(1,2,3,4,5))

Upvotes: 0

Andrie
Andrie

Reputation: 179448

You can use colwise to calculate the same statistic on multiple columns, for example:

ddply(df, .(N), colwise(mean))

  N      low_1      low_2     high_1      high_2
1 1 -1.3105923 -0.5507862  0.6304232 -0.04553457
2 2 -0.1586676  0.6820199 -0.8220206  0.93301381
3 3  0.4434761  0.4337073 -1.2988521  0.84412693
4 4  0.2522467 -0.1393690  0.2361361  1.64288051
5 5  0.4118032  0.4358705 -0.3529169  0.98916518

To use a regular expression on the column names, you can do something like the following:

  1. Use a regular expression with grep() to identify all columns you're interested in.
  2. Extract the column number of the grouping variable
  3. Pass a subset of the data to ddply, where the subset consists of only those columns identified in steps 1 and 2.

Try this:

idx <- grep("low", names(df))
idk <- which(names(df) == "N")
ddply(df[, c(idx, idk)], .(N), colwise(mean))

  N      low_1      low_2
1 1 -1.3105923 -0.5507862
2 2 -0.1586676  0.6820199
3 3  0.4434761  0.4337073
4 4  0.2522467 -0.1393690
5 5  0.4118032  0.4358705

Upvotes: 1

Richie Cotton
Richie Cotton

Reputation: 121077

As it stands, you need to pass a different argument for each statistic that you are calculating.

ddply(
  df,
  .(N), 
  summarise, 
  low_1  = mean(low_1), 
  low_2  = mean(low_2), 
  high_1 = mean(high_1), 
  high_2 = mean(high_2)
)

The idiomatic way of calculating this is to reshape your data to long format before calculating the stats.

library(plyr)
library(reshape2)
library(stringr)
df_long <- melt(df, id.vars = "N")
matches <- str_match(df_long$variable, "(low|high)_([[:digit:]])")
df_long <- within(
  df_long,
  {
    height <- matches[, 2]
    group <- as.integer(matches[, 3])
  }
)
ddply(
  df_long,
  .(N, height, group), 
  summarize, 
  mean_value = mean(value)
)

If you prefer, you can use mutate rather than within, and call to ddply can be replaced with modern dplyr syntax.

df_long %>%
  group_by(N, height, group) %>%
  summarize(mean_value = mean(value))

Upvotes: 0

Related Questions