Greg Sullivan
Greg Sullivan

Reputation: 183

Casting data in R

I have a data set with 4 different treatments listed in one column. They are treatment 1, treatment 2, treatment 3, and treatment 4 with the results displayed next to it in another column. I am trying to cast the data so I can run an anova to see which treatment was most effective. But when I try using the cast function I get back an error message. How do I sepperate the treatments into 4 separate columns with how effective they were in the column next to it. There is also another column with the observation number listed.

Upvotes: 0

Views: 108

Answers (3)

bdecaf
bdecaf

Reputation: 4732

I have the impression you try to calculate dummy coded columns as described here. Most regression functions do this automatically if you declare the treatment as factor.

If you need to do this manually the psych library provides the function dummy code (see).

Example (with the test data stolen from @jalapic):

df <- data.frame(group = rep(LETTERS[1:4], 4), scores = runif(16))
library(psych)
cbind(dummy.code(df$group), df)

gives:

   A B C D group     scores
1  1 0 0 0     A 0.73130714
2  0 1 0 0     B 0.60615966
3  0 0 1 0     C 0.74046980
4  0 0 0 1     D 0.13636377
5  1 0 0 0     A 0.33951820
6  0 1 0 0     B 0.95789481
7  0 0 1 0     C 0.26756629
8  0 0 0 1     D 0.71264870
9  1 0 0 0     A 0.87482811
10 0 1 0 0     B 0.69434779
11 0 0 1 0     C 0.92821292
12 0 0 0 1     D 0.70466255
13 1 0 0 0     A 0.30458528
14 0 1 0 0     B 0.90092407
15 0 0 1 0     C 0.77997403
16 0 0 0 1     D 0.05661558

Upvotes: 0

bgoldst
bgoldst

Reputation: 35314

You can do this in base R with the reshape() function. Demo:

set.seed(1);
df <- data.frame(obs=rep(1:3,each=4), treatment=rep(1:4,3), effect=rnorm(12,50,20) );
df;
##    obs treatment   effect
## 1    1         1 37.47092
## 2    1         2 53.67287
## 3    1         3 33.28743
## 4    1         4 81.90562
## 5    2         1 56.59016
## 6    2         2 33.59063
## 7    2         3 59.74858
## 8    2         4 64.76649
## 9    3         1 61.51563
## 10   3         2 43.89223
## 11   3         3 80.23562
## 12   3         4 57.79686
reshape(df,dir='w',idvar='obs',timevar='treatment');
##   obs effect.1 effect.2 effect.3 effect.4
## 1   1 37.47092 53.67287 33.28743 81.90562
## 5   2 56.59016 33.59063 59.74858 64.76649
## 9   3 61.51563 43.89223 80.23562 57.79686

The problem you're having is that your OBS column is unique per long-format record. reshape() depends on being able to use the idvar to identify when different long-format records should be combined into a single wide-format record. As you can see in my example data above, the four long-format records with obs=1 are combined into a single wide-format record, and ditto for obs=2 and obs=3.

To put this another way, an id column that is unique in the long-format is useless for reshape() when it is trying to decide how to transform the data.frame to wide format. In fact, there's simply no way to retain your OBS data in the wide-format, unless you create parallel wide columns with those key values, but I don't think you want to do that.

This problem can be solved by recomputing the OBS column to be unique per wide-format record. This can be done using transform() and ave() and grouping by treatment, so that each unique treatment gets its own observation sequence. Demo:

set.seed(2);
df <- data.frame(obs=1:12, treatment=rep(1:4,3), effect=rnorm(12,50,20) );
df;
##    obs treatment   effect
## 1    1         1 32.06171
## 2    2         2 53.69698
## 3    3         3 81.75691
## 4    4         4 27.39249
## 5    5         1 48.39496
## 6    6         2 52.64841
## 7    7         3 64.15909
## 8    8         4 45.20604
## 9    9         1 89.68948
## 10  10         2 47.22426
## 11  11         3 58.35302
## 12  12         4 69.63506
reshape(transform(df,obs=ave(1:nrow(df),df$treatment,FUN=seq_along)),dir='w',idvar='obs',timevar='treatment');
##   obs effect.1 effect.2 effect.3 effect.4
## 1   1 32.06171 53.69698 81.75691 27.39249
## 5   2 48.39496 52.64841 64.15909 45.20604
## 9   3 89.68948 47.22426 58.35302 69.63506

Through some deductive reverse-engineering, I think I've been able to decipher the kind of format your data is in, based on what you pasted into the comments. The following demonstrates the solution I presented above for synthesized data which I think closely resembles your own:

set.seed(33);
df <- data.frame(OBS=1:12, Treatment=c('PyrI','PyrII','Keto','Placebo')[rep(1:4,each=3)], Flaking=as.integer(rnorm(12,16,2)) );
df;
##    OBS Treatment Flaking
## 1    1      PyrI      15
## 2    2      PyrI      15
## 3    3      PyrI      18
## 4    4     PyrII      15
## 5    5     PyrII      11
## 6    6     PyrII      16
## 7    7      Keto      14
## 8    8      Keto      17
## 9    9      Keto      17
## 10  10   Placebo      13
## 11  11   Placebo      16
## 12  12   Placebo      15
reshape(transform(df,OBS=ave(1:nrow(df),df$Treatment,FUN=seq_along)),dir='w',idvar='OBS',timevar='Treatment');
##   OBS Flaking.PyrI Flaking.PyrII Flaking.Keto Flaking.Placebo
## 1   1           15            15           14              13
## 2   2           15            11           17              16
## 3   3           18            16           17              15

Upvotes: 1

jalapic
jalapic

Reputation: 14202

I'd do something like this,

your data:

df <- data.frame(group = rep(LETTERS[1:4], 4), scores = runif(16))

Assuming your data is two columns, one column specifying the group and another with the values in it. What the first bit of the below does is to count the occurrences of each group and then it spreads the data accordingly.

library(tidyr)
library(dplyr)

df %>% group_by(group) %>% mutate(count = row_number())  %>% spread(group, scores) 

Result:

 count         A         B          C          D
1     1 0.6211185 0.2278333 0.68751500 0.03104363
2     2 0.4507366 0.1834150 0.05700584 0.18217047
3     3 0.3411344 0.4382062 0.23057716 0.85572707
4     4 0.8514964 0.8386036 0.95319578 0.76257128

your values will vary because these are randomly generated and I forgot to set.seed. If you have uneven number of cases per group, it will fill the empty cells with NAs.

Upvotes: 0

Related Questions