Reputation: 183
I have a data set with 4 different treatments listed in one column. They are treatment 1, treatment 2, treatment 3, and treatment 4 with the results displayed next to it in another column. I am trying to cast the data so I can run an anova to see which treatment was most effective. But when I try using the cast function I get back an error message. How do I sepperate the treatments into 4 separate columns with how effective they were in the column next to it. There is also another column with the observation number listed.
Upvotes: 0
Views: 108
Reputation: 4732
I have the impression you try to calculate dummy coded columns as described here. Most regression functions do this automatically if you declare the treatment as factor.
If you need to do this manually the psych library provides the function dummy code (see).
Example (with the test data stolen from @jalapic):
df <- data.frame(group = rep(LETTERS[1:4], 4), scores = runif(16))
library(psych)
cbind(dummy.code(df$group), df)
gives:
A B C D group scores
1 1 0 0 0 A 0.73130714
2 0 1 0 0 B 0.60615966
3 0 0 1 0 C 0.74046980
4 0 0 0 1 D 0.13636377
5 1 0 0 0 A 0.33951820
6 0 1 0 0 B 0.95789481
7 0 0 1 0 C 0.26756629
8 0 0 0 1 D 0.71264870
9 1 0 0 0 A 0.87482811
10 0 1 0 0 B 0.69434779
11 0 0 1 0 C 0.92821292
12 0 0 0 1 D 0.70466255
13 1 0 0 0 A 0.30458528
14 0 1 0 0 B 0.90092407
15 0 0 1 0 C 0.77997403
16 0 0 0 1 D 0.05661558
Upvotes: 0
Reputation: 35314
You can do this in base R with the reshape()
function. Demo:
set.seed(1);
df <- data.frame(obs=rep(1:3,each=4), treatment=rep(1:4,3), effect=rnorm(12,50,20) );
df;
## obs treatment effect
## 1 1 1 37.47092
## 2 1 2 53.67287
## 3 1 3 33.28743
## 4 1 4 81.90562
## 5 2 1 56.59016
## 6 2 2 33.59063
## 7 2 3 59.74858
## 8 2 4 64.76649
## 9 3 1 61.51563
## 10 3 2 43.89223
## 11 3 3 80.23562
## 12 3 4 57.79686
reshape(df,dir='w',idvar='obs',timevar='treatment');
## obs effect.1 effect.2 effect.3 effect.4
## 1 1 37.47092 53.67287 33.28743 81.90562
## 5 2 56.59016 33.59063 59.74858 64.76649
## 9 3 61.51563 43.89223 80.23562 57.79686
The problem you're having is that your OBS
column is unique per long-format record. reshape()
depends on being able to use the idvar
to identify when different long-format records should be combined into a single wide-format record. As you can see in my example data above, the four long-format records with obs=1
are combined into a single wide-format record, and ditto for obs=2
and obs=3
.
To put this another way, an id column that is unique in the long-format is useless for reshape()
when it is trying to decide how to transform the data.frame to wide format. In fact, there's simply no way to retain your OBS
data in the wide-format, unless you create parallel wide columns with those key values, but I don't think you want to do that.
This problem can be solved by recomputing the OBS
column to be unique per wide-format record. This can be done using transform()
and ave()
and grouping by treatment
, so that each unique treatment
gets its own observation sequence. Demo:
set.seed(2);
df <- data.frame(obs=1:12, treatment=rep(1:4,3), effect=rnorm(12,50,20) );
df;
## obs treatment effect
## 1 1 1 32.06171
## 2 2 2 53.69698
## 3 3 3 81.75691
## 4 4 4 27.39249
## 5 5 1 48.39496
## 6 6 2 52.64841
## 7 7 3 64.15909
## 8 8 4 45.20604
## 9 9 1 89.68948
## 10 10 2 47.22426
## 11 11 3 58.35302
## 12 12 4 69.63506
reshape(transform(df,obs=ave(1:nrow(df),df$treatment,FUN=seq_along)),dir='w',idvar='obs',timevar='treatment');
## obs effect.1 effect.2 effect.3 effect.4
## 1 1 32.06171 53.69698 81.75691 27.39249
## 5 2 48.39496 52.64841 64.15909 45.20604
## 9 3 89.68948 47.22426 58.35302 69.63506
Through some deductive reverse-engineering, I think I've been able to decipher the kind of format your data is in, based on what you pasted into the comments. The following demonstrates the solution I presented above for synthesized data which I think closely resembles your own:
set.seed(33);
df <- data.frame(OBS=1:12, Treatment=c('PyrI','PyrII','Keto','Placebo')[rep(1:4,each=3)], Flaking=as.integer(rnorm(12,16,2)) );
df;
## OBS Treatment Flaking
## 1 1 PyrI 15
## 2 2 PyrI 15
## 3 3 PyrI 18
## 4 4 PyrII 15
## 5 5 PyrII 11
## 6 6 PyrII 16
## 7 7 Keto 14
## 8 8 Keto 17
## 9 9 Keto 17
## 10 10 Placebo 13
## 11 11 Placebo 16
## 12 12 Placebo 15
reshape(transform(df,OBS=ave(1:nrow(df),df$Treatment,FUN=seq_along)),dir='w',idvar='OBS',timevar='Treatment');
## OBS Flaking.PyrI Flaking.PyrII Flaking.Keto Flaking.Placebo
## 1 1 15 15 14 13
## 2 2 15 11 17 16
## 3 3 18 16 17 15
Upvotes: 1
Reputation: 14202
I'd do something like this,
your data:
df <- data.frame(group = rep(LETTERS[1:4], 4), scores = runif(16))
Assuming your data is two columns, one column specifying the group and another with the values in it. What the first bit of the below does is to count the occurrences of each group and then it spreads the data accordingly.
library(tidyr)
library(dplyr)
df %>% group_by(group) %>% mutate(count = row_number()) %>% spread(group, scores)
Result:
count A B C D
1 1 0.6211185 0.2278333 0.68751500 0.03104363
2 2 0.4507366 0.1834150 0.05700584 0.18217047
3 3 0.3411344 0.4382062 0.23057716 0.85572707
4 4 0.8514964 0.8386036 0.95319578 0.76257128
your values will vary because these are randomly generated and I forgot to set.seed
. If you have uneven number of cases per group, it will fill the empty cells with NAs.
Upvotes: 0