lukeg
lukeg

Reputation: 1357

combining datasets with known identity variable

So lets take the following data

set.seed(123)

A <- 1:10
age<- sample(20:50,10)
height <- sample(100:210,10)

df1 <- data.frame(A, age, height)


B <- c(1,1,1,2,2,3,3,5,5,5,5,8,8,9,10,10)
injury <- sample(letters[1:5],16, replace=T)

df2 <- data.frame(B, injury)

Now, we can merge the data using the following code:

df3 <- merge(df1, df2, by.x = "A", by.y = "B", all=T)

head(df3)
#   A age height injury
# 1 1  28    206      e
# 2 1  28    206      d
# 3 1  28    206      d
# 4 2  43    149      e
# 5 2  43    149      d
# 6 3  31    173      d

But what i want in the new data frame is the length of injury's as a level variable.

So the desired output should look like this:

enter image description here

So in this simple example we know that the max length of injury's is 4 per unique df2$B . So we need 4 new columns.

Must my data has an unknown number, so a code is needed to generate the correct, so something like

length(unique(df2$injury[df2$B]))

but that is also not correct syntax, as the output should equal 4

Upvotes: 1

Views: 65

Answers (2)

davechilders
davechilders

Reputation: 9133

If you want to accomplish this using the tidyr package, I found it necessary to create an index variable:

df3 %>%
  group_by(A) %>%
  mutate(ind = row_number()) %>%
  spread(ind, injury)

Upvotes: 1

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193667

I don't know where the letters are coming from in your sample output, because there are none in the variables in your sample input, but you can try something like:

library(splitstackshape)
dcast.data.table(getanID(df3, c("A", "age")), A + age + height ~ 
    .id, value.var = "injury")
##      A age height  1  2  3  4
##  1:  1  28    206  4  3  3 NA
##  2:  2  43    149  4  3 NA NA
##  3:  3  31    173  3  3 NA NA
##  4:  4  44    161 NA NA NA NA
##  5:  5  45    111  3  2  1  4
##  6:  6  21    195 NA NA NA NA
##  7:  7  33    125 NA NA NA NA
##  8:  8  41    104  4  3 NA NA
##  9:  9  32    133  4 NA NA NA
## 10: 10  30    197  1  2 NA NA

This adds a secondary ID based on the first two columns and then spreads it to a wide format.

Upvotes: 2

Related Questions