beginneR
beginneR

Reputation: 3291

Reshape data using dcast?

I don't know if using dcast() is the right way, but I want to reshape the following data.frame:

df <- data.frame(x=c("p1","p1","p2"),y=c("a","b","a"),z=c(14,14,16))
df
   x y  z
1 p1 a 14
2 p1 b 14
3 p2 a 16

so that it looks like this one:

df2 <- data.frame(x=c("p1","p2"),a=c(1,1),b=c(1,0),z=c(14,16))
   x a b  z
1 p1 1 1 14
2 p2 1 0 16

The variable y in df should be broken so that its elements are new variables, each dummy coded. All other variables (in this case just z) are equal for each person (p1,p2 etc.). The only variable where a specific person p has different values is y.
The reason I want this is because I need to merge this dataset with other ones by variable x. Thing is, it needs to be one row per person (p1,p2 etc).

Upvotes: 2

Views: 2275

Answers (4)

thelatemail
thelatemail

Reputation: 93813

This is almost a duplicate of a previous question, and the same basic answer I used there works again. No need for any external packages either.

aggregate(model.matrix(~ y - 1, data=df),df[c("x","z")],max)

   x  z ya yb
1 p1 14  1  1
2 p2 16  1  0

To explain this, as it is a bit odd looking, the model.matrix call at its most basic returns a binary indicator variable for each unique value for each row of your data.frame, like so:

  ya yb
1  1  0
2  0  1
3  1  0

If you aggregate that intermediate result by your two id variables (x and z), you are then essentially acting on the initial data.frame of:

   x  z ya yb
1 p1 14  1  0
2 p1 14  0  1
3 p2 16  1  0

So if you take the max value of ya and yb within each combination of x and z, you basically do:

   x  z ya      yb
1 p1 14  1*max*  0
2 p1 14  0       1*max*

--collapse--

   x  z ya      yb
1 p1 14  1       1

...and repeat that for each unique x/z combination to give the final result:

   x  z ya yb
1 p1 14  1  1
2 p2 16  1  0

Things get a bit crazy to generalise this to more columns, but it can be done, courtesy of this question e.g.:

df <- data.frame(x=c("p1","p1","p2"),y=c("a","b","a"),z=c("14","15","16"))
intm <- model.matrix(~ y + z - 1, data=df,
                 contrasts.arg = sapply(df[2:3], contrasts, contrasts=FALSE))
aggregate(intm,df[c("x")],max)

   x ya yb z14 z15 z16
1 p1  1  1   1   1   0
2 p2  1  0   0   0   1

Upvotes: 2

santeko
santeko

Reputation: 510

I'm not sure much of this you have to do but if you need a way to automate it, I wrote this little function that might help:

First run dcast:

new = dcast(df, x+z~y, value.var="y")

Load into your R environment:

 # args to be passed: 
 # df is your dataframe 
 # cols is a list of format c("colname1", "colname2", ... , "colnameN")
    binarizeCols = function(df, cols){
      for(i in cols){
        column = which(colnames(df) == i)
        truthRow = is.na(df[,column])
        for(j in 1:length(truthRow)){
          if(truthRow[j] == FALSE){
            df[j,column] = 1
          }else{
             df[j,column] = 0
           }
        }
      }
      return(df)
    }

then run:

new = binarizeCols(new, c("a", "b"))

and you get:

     x  z  a  b
   1 p1 14 1  1 
   2 p2 16 1  0

not as fast as using _apply() but there's no hardcoding, you can enter any colnames you want (maybe you want to skip one in the middle?) and you dont create a new instance of your df. note: I use "=" instead of "<-" because I thought it was being phased out but they can be replaced if need be.

Upvotes: 1

IRTFM
IRTFM

Reputation: 263332

df <- data.frame(x=c("p1","p1","p2","p3"),
                 y=c("a","b","a","c"),
                 z=c(14,14,16,17))  # wanted larger test case.
new <- dcast(df, x+z~y, value.var="y")
new[3:5] <- sapply(lapply(new[3:5], '%in%', unique(df$y) ), as.numeric)
new
   x  z a b c
1 p1 14 1 1 0
2 p2 16 1 0 0
3 p3 17 0 0 1

First check for containment in a vector that summarizes the possible values to create columns of logical values. Then 'dummify' by taking as.numeric of those logical values.

Upvotes: 0

Ricardo Saporta
Ricardo Saporta

Reputation: 55350

The following works, but seems cumbersome.

df2 <- df
df2$y <- as.numeric(y)
df$y2 <- as.numeric(df$y)

df2 <- dcast(df, x+z~y, value.var="y2")

df2
   x  z a  b
1 p1 14 1  2
2 p2 16 1 NA

Upvotes: 2

Related Questions