Daniel Anderson
Daniel Anderson

Reputation: 2424

tidyr::gather multiple columns of varying types

My question is similar to this question. I'm trying to tidyr::gather multiple columns. However, the solution provided in the link is less than ideal because the attributes are generally not identical across all columns and so they are dropped.

Note, I know how to do this with base R, but I'm trying to learn how to do the equivalent operation with tidyr and/or dplyr.

Below I've simulated some data (poorly, but quickly) that illustrate the situation I often find myself in (although I generally have far more columns that follow this same sort of a pattern). I've provided the base solution with stats::reshape so you can see my desired output.

Any help would be much appreciated.

set.seed(123)
male_g6 <- rbinom(100, 1, .5)
ell_g6 <- rbinom(100, 1, .1)
sped_g6 <- rbinom(100, 1, .15)
pullouts_g6 <- rbinom(100, 5, .1)
disability_g6 <- replicate(100, 
                sample(
                    c("asd", "cd", "ed", "hi", "id", "ohi", "ld", "none"),
                    1,
                    prob = c(rep(0.01, 6), 0.05, 0.89)
                    )
                 )
score_g6 <- rnorm(100, 200, 10)
score_g7 <- score_g6 + 5 + rnorm(100, 0, 2)
score_g8 <- score_g7 + 5 + rnorm(100, 0, 2)

d <- data.frame(
        SID = 1:100,
        male_g6 = male_g6,
        male_g7 = male_g6,
        male_g8 = male_g6,
        ell_g6 = ell_g6,
        ell_g7 = ell_g6,
        ell_g8 = ell_g6,
        sped_g6 = sped_g6,
        sped_g7 = sped_g6,
        sped_g8 = sped_g6,
        pullouts_g6 = pullouts_g6,
        pullouts_g7 = pullouts_g6,
        pullouts_g8 = pullouts_g6,
        disability_g6 = disability_g6,
        disability_g7 = disability_g6,
        disability_g8 = disability_g6,
        score_g6 = score_g6,
        score_g7 = score_g7,
        score_g8 = score_g8
    )

With base reshape

ld <- stats::reshape(d,
        idvar = "SID",
        varying = list(
            c("male_g6", "male_g7", "male_g8"),
            c("ell_g6", "ell_g7", "ell_g8"),
            c("sped_g6", "sped_g7", "sped_g8"),
            c("pullouts_g6", "pullouts_g7", "pullouts_g8"),
            c("disability_g6", "disability_g7", "disability_g8"),
            c("score_g6", "score_g7", "score_g8")
            ),
        v.names = c("male", "ell", "sped", "pullouts", "disability", "score"),
        times = 6:8,
        timevar = "Grade",
        direction = "long"
    )
ld <- ld[order(ld$SID), ]

Upvotes: 4

Views: 489

Answers (1)

alistaire
alistaire

Reputation: 43334

You'll need to gather beyond what you want to end with so you can separate the grade level from the headers, after which you can spread back to wide form:

ld2 <- d %>% gather(var, val, -SID) %>%     # gather to long form
    # separate grade from variable names
    separate(var, c('var', 'grade'), sep = '_g', convert = TRUE) %>% 
    spread(var, val, convert = TRUE)    # spread back to wide

head(ld2)

##   SID grade disability ell male pullouts    score sped
## 1   1     6         cd   0    0        1 196.2440    0
## 2   1     7         cd   0    0        1 203.2739    0
## 3   1     8         cd   0    0        1 211.1347    0
## 4   2     6       none   0    1        0 194.3812    1
## 5   2     7       none   0    1        0 195.3957    1
## 6   2     8       none   0    1        0 202.4890    1

Upvotes: 2

Related Questions