Magnus
Magnus

Reputation: 760

What's the best way to reshape this data into wide format?

I'm trying to reshape my data into wide form, but I'm new to this, and so far the ordinary reshape function doesn't seem to work. When I try with the full data (this excerpt seems to work though) I get NA values for all of my values and the column names turn to strange numerical vectors.

My data looks like this, PNR is an id variable that's unique to each observation.

    PNR ZIPCODE SEL_CRITERION   QUAL_RATING QUEUENUMBER UPSEC_ID    UPSEC_COURSE_ID MARK    COURSEOFFERING_ID   ADMISSIONROUND_ID   RESULT  WITHIN_PROGRAM  SUMMA
1234567890  46395   HB      55  0   HRF         SV203       G   97116   HT2019      20  0   67.5
1234567890  46395   HB      55  0   HRF         EN200       VG  97116   HT2019      20  0   67.5
1234567890  46395   HB      55  0   HRF         MA200       VG  97116   HT2019      20  0   67.5
1234567890  46395   HB      55  0   HRF         <null>  <null>  97116   HT2019      20  0   67.5
2345678901  42332   B5      2645    0   3SB         EN1201      VG  97116   HT2019      20  0   70.5
2345678901  42332   B5      2645    0   3SB         MA1201      VG  97113   HT2019      20  0   70.5
2345678901  42332   B5      2645    0   2SM         SV1201      VG  97113   HT2019      20  0   70.5

I would like for it to look something like this:

PNR ZIPCODE      HB  B5   QUEUENUMBER UPSEC_ID SV203 EN200 MA200 <null> EN1201 MA1201 SV1201
1234567890 46395 95  NA    0           HRF      G     VG    VG    NA     NA     NA     NA
2345678901 42332 NA  1645  0           3SB      NA    NA    NA    NA     VG     VG     VG

Is there any way I can make this happen?

I've tried to use the ordinary reshape function, which fails miserably (though not necessarily on this small excerpt):

test<-reshape(HT2018, idvar="PNR",timevar=c("SEL_CRITERION", "UPSEC_COURSE_ID"), v.names=c("QUAL_RATING","MARK"), direction = "wide")

I've also tried the melt and cast functions from the reshape2 package, which returns a single row with three values (neither of them correct) though I might of course be doing something wrong:

test<-melt(HT2018, id="PNR")
test<-cast(test, QUAL_RATING + MARK ~ PNR)

structure(list(PNR = c(1234567890, 1234567890, 1234567890, 1234567890, 
2345678901, 2345678901, 2345678901), ZIPCODE = c(46395L, 46395L, 
46395L, 46395L, 42332L, 42332L, 42332L), SEL_CRITERION = structure(c(2L, 
2L, 2L, 2L, 1L, 1L, 1L), .Label = c("B5", "HB   "), class = "factor"), 
    QUAL_RATING = c(55L, 55L, 55L, 55L, 2645L, 2645L, 2645L), 
    QUEUENUMBER = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), UPSEC_ID = structure(c(2L, 
    2L, 2L, 2L, 1L, 1L, 1L), .Label = c("3SB", "HRF"), class = "factor"), 
    UPSEC_COURSE_ID = structure(c(7L, 3L, 5L, 1L, 2L, 4L, 6L), .Label = c("<null>", 
    "EN1201     ", "EN200      ", "MA1201     ", "MA200      ", 
    "SV1201     ", "SV203      "), class = "factor"), MARK = structure(c(2L, 
    3L, 3L, 1L, 3L, 3L, 3L), .Label = c("<null>", "G  ", "VG "
    ), class = "factor"), COURSEOFFERING_ID = c(97113L, 97113L, 
    97113L, 97113L, 97113L, 97113L, 97113L), ADMISSIONROUND_ID = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L), .Label = "HT2018    ", class = "factor"), 
    RESULT = c(20L, 20L, 20L, 20L, 20L, 20L, 20L), WITHIN_PROGRAM = c(0L, 
    0L, 0L, 0L, 0L, 0L, 0L), SUMMA = structure(c(1L, 1L, 1L, 
    1L, 2L, 2L, 2L), .Label = c("67.5", "70.5"), class = "factor")), class = "data.frame", row.names = c(NA, 
-7L))

Upvotes: 0

Views: 58

Answers (1)

JDG
JDG

Reputation: 1364

Pending clarification on the aggregation of values HB/B5 (comment), you can use data.table's dcast() to convert the data to wide format across MARK levels of UPSEC_COURSE_ID.

dt = data.table(dt)
dt_betyg = dcast(dt, PNR + ZIPCODE + UPSEC_ID + QUEUENUMBER + SEL_CRITERION + QUAL_RATING ~ UPSEC_COURSE_ID, value.var = c('MARK'))

Result

> dt_betyg
          PNR ZIPCODE UPSEC_ID QUEUENUMBER SEL_CRITERION QUAL_RATING <null> EN1201      EN200       MA1201      MA200       SV1201      SV203      
1: 1234567890   46395      HRF           0         HB             55 <null>        <NA>         VG         <NA>         VG         <NA>         G  
2: 2345678901   42332      3SB           0            B5        2645   <NA>         VG         <NA>         VG         <NA>         VG         <NA>

Upvotes: 1

Related Questions