Kevin Wang
Kevin Wang

Reputation: 41

Using reshape() function in R -- from wide to long

I'm trying to re-arrange the data in R from something like:

Patient ID,Episode Number,Admission Date (A),Admission Date (H),Admission Time (A),Admission Time (H)
1,5,20/08/2011,21/08/2011,1200,1300
2,6,21/08/2011,22/08/2011,1300,1400
3,7,22/08/2011,23/08/2011,1400,1500
4,8,23/08/2011,24/08/2011,1500,1600

to something like:

Record Type,Patient ID,Episode Number,Admission Date,Admission Time
H,1,5,20/08/2011,1200
A,1,5,21/08/2011,1300
H,2,6,21/08/2011,1300
A,2,6,22/08/2011,1400
H,3,7,22/08/2011,1400
A,3,7,23/08/2011,1500
H,4,8,23/08/2011,1500
A,4,8,24/08/2011,1600

(I've used CSV format so it's easier to to use them as test data).

I tried the reshape() function and it kind of worked:

> reshape(foo, direction = "long", idvar = 1, varying = 3:dim(foo)[2], 
> sep = "..", timevar = "dataset")
     Patient.ID Episode.Number dataset Admission.Date Admission.Time
1.A.          1              5      A.     20/08/2011           1200
2.A.          2              6      A.     21/08/2011           1300
3.A.          3              7      A.     22/08/2011           1400
4.A.          4              8      A.     23/08/2011           1500
1.H.          1              5      H.     21/08/2011           1300
2.H.          2              6      H.     22/08/2011           1400
3.H.          3              7      H.     23/08/2011           1500
4.H.          4              8      H.     24/08/2011           1600

But it's not in the exact format I wanted (I want for each "Patient ID", the first row is "H" and second row is "A").

In addition, when I extend this to the read data (which has 250+ columns) it failed:

> reshape(realdata, direction = "long", idvar = 1, varying = 
> 6:dim(foo)[2], sep = "..", timevar = "dataset")
Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying,  : 
  'varying' arguments must be the same length

I think partly because the colnames look like:

> colnames(foo)
  [1] "Unique.Key"                                    
  [2] "Campus.Code"                                   
  [3] "UR"                                            
  [4] "Terminal.digit"                                
  [5] "Admission.date..A."                      
  [6] "Admission.date..H."                     
  [7] "Admission.time..A."                      
  [8] "Admission.time..H."     
  .
  .
  .
 [31] "Medicare.Number"                               
 [32] "Payor"                                         
 [33] "Doctor.specialty"                              
 [34] "Clinic"     
  .
  .
  .
 [202] "Admission.Source..A."                    
 [203] "Admission.Source..H."  

i.e. there are "common columns" (without suffixes) between columns with suffixes (hope this makes sense).

Upvotes: 4

Views: 10575

Answers (2)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

The suggestions to use melt and cast (now dcast and family) from the "reshape" (now "reshape2") package won't get you to the form you are looking for with your data. In particular, you'll need to do some additional processing if your end goal is the "semi-long" format you describe.

There are two problems you raise in your question:

First is the ordering of the results. As @RichieCotton points out in his comment and @mac in his answer, a call to order() is enough to solve that problem.

Second is the error:

Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying, :
  'varying' arguments must be the same length

This is because, as you guessed, there are non-varying columns in your list of varying = 6:dim(foo)[2] selection.

One easy way to fix this is to use grep to identify which columns are varying and use that to specify your columns instead of using an (incorrect) catchall as you did. Here's a worked example:

set.seed(1)
foo <- data.frame(Unique.Key = 1:4, Campus.Code = LETTERS[1:4], 
                  Admission.Date..A = 11:14, Admission.Date..H = 21:24,
                  Medicare.Number = letters[1:4], Payor = letters[1:4],
                  Admission.Source..A = rnorm(4), 
                  Admission.Source..H = rnorm(4))
foo
#   Unique.Key Campus.Code Admission.Date..A Admission.Date..H Medicare.Number
# 1          1           A                11                21               a
# 2          2           B                12                22               b
# 3          3           C                13                23               c
# 4          4           D                14                24               d
#   Payor Admission.Source..A Admission.Source..H
# 1     a          -0.6264538           0.3295078
# 2     b           0.1836433          -0.8204684
# 3     c          -0.8356286           0.4874291
# 4     d           1.5952808           0.7383247

Figure out which columns are varying and use that as your varying argument:

varyingCols <- grep("\\.\\.A$|\\.\\.H$", names(foo))

out <- reshape(foo, direction = "long", idvar = "Unique.Key", 
               varying = varyingCols, sep = "..")
out[order(out$Unique.Key, rev(out$time)), ]
#     Unique.Key Campus.Code Medicare.Number Payor time Admission.Date Admission.Source
# 1.H          1           A               a     a    H             21        0.3295078
# 1.A          1           A               a     a    A             11       -0.6264538
# 2.H          2           B               b     b    H             22       -0.8204684
# 2.A          2           B               b     b    A             12        0.1836433
# 3.H          3           C               c     c    H             23        0.4874291
# 3.A          3           C               c     c    A             13       -0.8356286
# 4.H          4           D               d     d    H             24        0.7383247
# 4.A          4           D               d     d    A             14        1.5952808

If your data are small (not many columns), you can manually count the location of the varying columns and specify the vector. As you've already noticed, any columns not specified in idvar or varying get recycled appropriately.

out <- reshape(foo, direction = "long", idvar = "Unique.Key", 
               varying = c(3, 4, 7, 8), sep = "..")

Upvotes: 2

mac
mac

Reputation: 3554

You might be able to get what you're after by using melt and cast or reshape, but you're looking for something pretty specific, so it might be simpler to do the reshaping directly. You can subset the original data into two separate data frames, (one for A, one for H) then glue them back together.

The code below works on your sample data, but i also tried to write it flexibly enough so that it will hopefully work on your larger data set, as long as the columns are named consistently with the ..A. and ..H. suffixes.

#grab the common columns and the "A" columns
#(by using grepl to find any column that doesn't end in ".H.")
foo.a <- foo[,!grepl(x=colnames(foo),pattern = "\\.H\\.$")]

#strip the "..A." from the end of the ".A." column names
colnames(foo.a) <- sub(x=colnames(foo.a),
                   pattern="(.*)\\.\\.A\\.$",
                   rep = "\\1")
foo.a$Record.Type <- "A"

#grab the common columns and the "H" columns
#(by using grepl to find any column that doesn't end in ".A.")
foo.h <- foo[,!grepl(x=colnames(foo),pattern = "\\.A\\.$")]

#strip the "..H." from the end of the "..H." column names
colnames(foo.h) <- sub(x=colnames(foo.h),
                   pattern="(.*)\\.\\.H\\.$",
                   rep = "\\1")
foo.h$Record.Type <- "H"

#stick them back together
new.foo <- rbind(foo.a,foo.h)

#order by Patient.ID
new.foo <- new.foo[with(new.foo,order(Patient.ID)),]

#re-order the columns as you like
new.foo <- new.foo[,c(1,2,5,3,4)]

This gives me:

> new.foo  
   Patient.ID Episode.Number Record.Type Admission.Date Admission.Time  
1          1              5           A     20/08/2011           1200  
5          1              5           H     21/08/2011           1300  
2          2              6           A     21/08/2011           1300  
6          2              6           H     22/08/2011           1400  
3          3              7           A     22/08/2011           1400  
7          3              7           H     23/08/2011           1500  
4          4              8           A     23/08/2011           1500  
8          4              8           H     24/08/2011           1600  

Upvotes: 0

Related Questions