Hotamd6
Hotamd6

Reputation: 325

Transpose / reshape dataframe without "timevar" from long to wide format

I have a data frame that follows the below long Pattern:

   Name          MedName
  Name1    atenolol 25mg
  Name1     aspirin 81mg
  Name1 sildenafil 100mg
  Name2    atenolol 50mg
  Name2   enalapril 20mg

And would like to get below (I do not care if I can get the columns to be named this way, just want the data in this format):

   Name   medication1    medication2      medication3
  Name1 atenolol 25mg   aspirin 81mg sildenafil 100mg
  Name2 atenolol 50mg enalapril 20mg             NA

Through this very site I have become familiarish with the reshape/reshape2 package, and have went through several attempts to try to get this to work but have thus far failed.

When I try dcast(dataframe, Name ~ MedName, value.var='MedName') I just get a bunch of columns that are flags of the medication names (values that get transposed are 1 or 0) example:

 Name  atenolol 25mg  aspirin 81mg
Name1              1             1
Name2              0             0 

I also tried a dcast(dataset, Name ~ variable) after I melted the dataset, however this just spits out the following (just counts how many meds each person has):

 Name  MedName
Name1        3
name2        2

Finally, I tried to melt the data and then reshape using idvar="Name" timevar="variable" (of which all just are Mednames), however this does not seem built for my issue since if there are multiple matches to the idvar, the reshape just takes the first MedName and ignores the rest.

Does anyone know how to do this using reshape or another R function? I realize that there probably is a way to do this in a more messy manner with some for loops and conditionals to basically split and re-paste the data, but I was hoping there was a more simple solution. Thank you so much!

Upvotes: 31

Views: 9035

Answers (9)

GKi
GKi

Reputation: 39657

In base you can use split or unstack to split up MedName by Name. Then set the length of each list element to the same length and rbind them.

. <- split(x$MedName, x$Name)
#. <- unstack(x[2:1]) #Alternative

do.call(rbind, lapply(., `length<-`, max(lengths(.))))
#do.call(rbind, lapply(., "[", 1:max(lengths(.)))) #Alternative
#t(sapply(., "[", 1:max(lengths(.)))) #Alternative

#      [,1]            [,2]             [,3]              
#Name1 "atenolol 25mg" "aspirin 81mg"   "sildenafil 100mg"
#Name2 "atenolol 50mg" "enalapril 20mg" NA                

Note that the result is a matrix. In case a data.frame is needed use in addition as.data.frame.matrix.

Data

x <- read.table(header=TRUE, text="
  Name          MedName
  Name1    'atenolol 25mg'
  Name1     'aspirin 81mg'
  Name1 'sildenafil 100mg'
  Name2    'atenolol 50mg'
  Name2   'enalapril 20mg'")

Benchmark

bench::mark(check=FALSE,
Jaap1 = dcast(setDT(y), Name ~ rowid(Name, prefix = "medication"), value.var = "MedName"),
Jaap2 = x %>%  group_by(Name) %>%  mutate(rn = paste0("medication",row_number())) %>% spread(rn, MedName),
mnel = {data_with_index <- ddply(x, .(Name), mutate, index = paste0('medication', 1:length(Name)))    
  dcast(setDT(data_with_index), Name ~ index, value.var = 'MedName') },
thelatemail = reshape(within(x, uniqid <- ave(as.character(Name), Name, FUN = seq_along)), idvar = "Name", timevar = "uniqid", direction = "wide"),
a5c1d2h2i1m1n2o1r2t1 = dcast.data.table(getanID(y, "Name"), Name ~ .id, value.var = "MedName"),
"Anthony Damico" = {. <- x[order(x[, "Name"]),]
  .$time <- unlist( lapply( rle( as.character( .[ , "Name" ] ) )$lengths , seq_len ) )
  reshape( . , idvar = "Name" , direction = 'wide' ) },
"Ric S" = x %>%  group_by(Name) %>%  mutate(row_n = row_number()) %>% 
  pivot_wider(id_cols = Name, names_from = row_n, values_from = MedName, names_glue = "medication{row_n}"),
"Darren Tsai" = x %>% chop(-Name) %>% unnest_wider(MedName, names_sep = ""),
moodymudskipper = x %>% group_by(Name) %>% do(as_tibble(t(unlist(.[2])))),
GKi = {. <- split(x$MedName, x$Name)
  do.call(rbind, lapply(., "[", 1:max(lengths(.)))) },
GKi2 = {. <- split(x$MedName, x$Name)
  do.call(rbind, lapply(., `length<-`, max(lengths(.))))}
)

Result

   expression                min   median `itr/sec` mem_al…¹ gc/se…² n_itr  n_gc
   <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:by>   <dbl> <int> <dbl>
 1 Jaap1                910.83µs    1.5ms     645.   274.5KB    8.32   310     4
 2 Jaap2                  9.64ms  11.53ms      86.9   16.5KB    9.15    38     4
 3 mnel                   4.69ms   5.91ms     164.   280.1KB    6.31    78     3
 4 thelatemail          619.74µs 898.36µs     959.        0B   12.6    458     6
 5 a5c1d2h2i1m1n2o1r2t1   1.36ms   2.18ms     426.   291.4KB    8.70   196     4
 6 Anthony Damico       565.34µs 794.76µs    1123.        0B   12.5    537     6
 7 Ric S                 13.51ms  15.59ms      62.4   64.2KB    6.69    28     3
 8 Darren Tsai            3.75ms   5.02ms     199.    22.7KB    8.54    93     4
 9 moodymudskipper       12.88ms  15.88ms      63.5   26.5KB    6.81    28     3
10 GKi                   31.58µs  35.73µs   21448.        0B   17.2   9992     8
11 GKi2                  29.41µs  32.77µs   23717.        0B   16.6   9993     7

GKi is in this case about 15 times faster than the second and is in the group which is allocating no additional memory.

Upvotes: 1

Darren Tsai
Darren Tsai

Reputation: 35554

A tidyr solution with chop() and unnest_wider().

library(tidyr)

df %>%
  chop(-Name) %>%
  unnest_wider(MedName, names_sep = "")

# # A tibble: 2 x 4
#   Name  MedName1      MedName2       MedName3        
#   <chr> <chr>         <chr>          <chr>           
# 1 Name1 atenolol 25mg aspirin 81mg   sildenafil 100mg
# 2 Name2 atenolol 50mg enalapril 20mg NA 

The argument names_sep = "" is necessary; otherwise, the new column names will be ..1, ..2, and ..3.


Data

df <- structure(list(Name = c("Name1", "Name1", "Name1", "Name2", "Name2"
), MedName = c("atenolol 25mg", "aspirin 81mg", "sildenafil 100mg", 
"atenolol 50mg", "enalapril 20mg")), class = "data.frame", row.names = c(NA, -5L))

Upvotes: 1

Ric S
Ric S

Reputation: 9247

One clean solution involves the very useful pivot_wider function from the tidyr package version 1.1.0. With this you can also directly specify the column names by using the argument names_glue.

library(tidyr)
library(dplyr)

dataframe %>% 
  group_by(Name) %>% 
  mutate(row_n = row_number()) %>% 
  pivot_wider(id_cols = Name, names_from = row_n, values_from = MedName, names_glue = "medication{row_n}")

Output

# A tibble: 2 x 4
# Groups:   Name [2]
#   Name  medication1   medication2    medication3     
#   <chr> <chr>         <chr>          <chr>           
# 1 Name1 atenolol 25mg aspirin 81mg   sildenafil 100mg
# 2 Name2 atenolol 50mg enalapril 20mg NA  

Upvotes: 5

Jaap
Jaap

Reputation: 83215

With the package, this could easily be solved with the new rowid function:

library(data.table)
dcast(setDT(d1), 
      Name ~ rowid(Name, prefix = "medication"), 
      value.var = "MedName")

which gives:

   Name    medication1     medication2       medication3
1 Name1  atenolol 25mg    aspirin 81mg  sildenafil 100mg
2 Name2  atenolol 50mg  enalapril 20mg              <NA>

Another method (commonly used before version 1.9.7):

dcast(setDT(d1)[, rn := 1:.N, by = Name], 
      Name ~ paste0("medication",rn), 
      value.var = "MedName")

giving the same result.


A similar approach, but now using the and packages:

library(dplyr)
library(tidyr)
d1 %>%
  group_by(Name) %>%
  mutate(rn = paste0("medication",row_number())) %>%
  spread(rn, MedName)

which gives:

Source: local data frame [2 x 4]
Groups: Name [2]

    Name   medication1    medication2      medication3
  (fctr)         (chr)          (chr)            (chr)
1  Name1 atenolol 25mg   aspirin 81mg sildenafil 100mg
2  Name2 atenolol 50mg enalapril 20mg               NA

Upvotes: 18

mnel
mnel

Reputation: 115382

Assuming your data is in the object dataset:

library(plyr)
## Add a medication index
data_with_index <- ddply(dataset, .(Name), mutate, 
                         index = paste0('medication', 1:length(Name)))    
dcast(data_with_index, Name ~ index, value.var = 'MedName')

##    Name   medication1    medication2      medication3
## 1 Name1 atenolol 25mg   aspirin 81mg sildenafil 100mg
## 2 Name2 atenolol 50mg enalapril 20mg             <NA>

Upvotes: 16

moodymudskipper
moodymudskipper

Reputation: 47300

Here's a shorter way, taking advantage of the way unlist deals with names:

library(dplyr)
df1 %>% group_by(Name) %>% do(as_tibble(t(unlist(.[2]))))
# # A tibble: 2 x 4
# # Groups:   Name [2]
#      Name      MedName1       MedName2         MedName3
#     <chr>         <chr>          <chr>            <chr>
#   1 name1 atenolol 25mg   aspirin 81mg sildenafil 100mg
#   2 name2 atenolol 50mg enalapril 20mg             <NA>

Upvotes: 1

Anthony Damico
Anthony Damico

Reputation: 6104

@thelatemail's solution is similar to this one. When I generate the time variable, I use rle in case I'm not working interactively and the Name variable needs to be dynamic.

# start with your example data
x <- 
    data.frame(
        Name=c(rep("name1",3),rep("name2",2)),
        MedName=c("atenolol 25mg","aspirin 81mg","sildenafil 100mg",
            "atenolol 50mg","enalapril 20mg")
    )

# pick the id variable
id <- 'Name'

# sort the data.frame by that variable
x <- x[ order( x[ , id ] ) , ]

# construct a `time` variable on the fly
x$time <- unlist( lapply( rle( as.character( x[ , id ] ) )$lengths , seq_len ) )

# `reshape` uses that new `time` column by default
y <- reshape( x , idvar = id , direction = 'wide' )

# done
y

Upvotes: 5

thelatemail
thelatemail

Reputation: 93813

You could always generate a unique timevar before using reshape. Here I use ave to apply the function seq_along 'along' each "Name".

test <- data.frame(
Name=c(rep("name1",3),rep("name2",2)),
MedName=c("atenolol 25mg","aspirin 81mg","sildenafil 100mg",
          "atenolol 50mg","enalapril 20mg")
)

# generate the 'timevar'
test$uniqid <- with(test, ave(as.character(Name), Name, FUN = seq_along))

# reshape!
reshape(test, idvar = "Name", timevar = "uniqid", direction = "wide")

Result:

   Name     MedName.1      MedName.2        MedName.3
1 name1 atenolol 25mg   aspirin 81mg sildenafil 100mg
4 name2 atenolol 50mg enalapril 20mg             <NA>

Upvotes: 15

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

This seems to actually be a fairly common problem, so I have included a function called getanID in my "splitstackshape" package.

Here's what it does:

library(splitstackshape)
getanID(test, "Name")
#     Name          MedName .id
# 1: name1    atenolol 25mg   1
# 2: name1     aspirin 81mg   2
# 3: name1 sildenafil 100mg   3
# 4: name2    atenolol 50mg   1
# 5: name2   enalapril 20mg   2

Since "data.table" is loaded along with "splitstackshape", you have access to dcast.data.table, so you can proceed as with @mnel's example.

dcast.data.table(getanID(test, "Name"), Name ~ .id, value.var = "MedName")
#     Name             1              2                3
# 1: name1 atenolol 25mg   aspirin 81mg sildenafil 100mg
# 2: name2 atenolol 50mg enalapril 20mg               NA

The function essentially implements a sequence(.N) by the groups identified to create the "time" column.

Upvotes: 11

Related Questions