Steen Harsted
Steen Harsted

Reputation: 1932

Transforming multible columns from wide to long format.


First up - I don´t think this is a duplicate post. I have found several great posts and webpages on transforming multiple columns from wide to long format, but none of them are similar to my data since they deal with several columns that are measures of the same thing (in my example that would be A1, A2, A3, A4), and also they do not already contain a long format variable (frame in my case).


Here is my question:

I am working with a dataset containing many variabels measured by two different motion capture systems. Currently my dataset is in wide format, but I´ve realized ggplot works much better with long format, and I therefore wish to transform my data.

Here is a very simplified version of my data:

id <- (rep(1:3, each = 3))
frame <- (rep(1:3, 3))
A1 <- runif(9, min =1, max =100)
B1 <- runif(9, min =1, max =10)
C1 <- runif(9, min =-10, max =10)
A2 <- rnorm(9, mean = A1, sd=1)
B2 <- rnorm(9, mean = B1, sd=1)
C2 <- rnorm(9, mean = C1, sd=1)
df_wide <- as.data.frame.matrix(cbind(id, frame, A1, B1, C1, A2, B2, C2))
rm(id, frame, A1, A2, B1, B2, C1, C2)

df_wide$id <- as.factor(df_wide$id)
df_wide$frame <- as.factor(df_wide$frame)

head(df_wide)

  id frame        A1       B1        C1        A2       B2         C2
1  1     1 50.940395 4.141713 -1.294736 51.324398 4.271260  0.6174782
2  1     2 33.117691 5.044080  1.820367 32.977860 5.506677  0.8811504
3  1     3 50.000625 8.584148 -1.294245 50.603195 8.099262  0.6418580
4  2     1 61.675927 5.269216 -6.002856 61.996378 6.186417 -6.5428624
5  2     2  5.514353 6.570010  5.199728  4.798275 4.955662  5.1502535
6  2     3 51.580086 5.683788  9.831663 50.717459 5.430070 10.9601541

A1 and A2, B1 and B2, are measures of the same type of movements (A and B) made by system 1 and system 2. Each patient has been measured several times as indicated by the frame variable.

I would like my data.frame to look like this:

  id frame system        A        B         C
1  1     1      1 
2  1     1      2 
3  1     2      1 
4  1     2      2 
5  1     3      1 
6  1     3      2 

I have two problems keeping me from solving this:

1) measurements between the two systems are not placed next to each other. Therefore I cant use code like this:

library(tidyr)
df_long <- gather(df_wide, System, A, A1:A2, factor_key=TRUE)

2) My dataset contains close to 120 variabels, so I would like a solution that wouldn't require me to write the code for each variable. I am considering making a loop to solve this, but any help in this regard will be much appreciated as well.

Upvotes: 1

Views: 76

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 270368

Run reshape and then sort the result.

The first 4 lines set up the arguments to reshape. In particular, varying is list(A = c("A1", "A2"), B = c("B1", "B2"), C = c("C1", "C2")) . The last line of code sorts the rows and that could be omitted if the row order is unimportant.

Here the A... columns have the same type as the B... columns and the C... columns but this solution would continue to work even if that were not the case.

No packages are used.

This SO question/answer is similar but there are small differences: Gather multiple date/value columns using tidyr

idvar <- 1:2
nms <- names(df_wide)[-idvar]   # names of non-id variables
varying <- split(nms, sub("\\d+$", "", nms))
v.names <- names(varying)
r <- reshape(df_wide, dir = "long", varying = varying, v.names = v.names, idvar = idvar)
r[order(r$id, r$frame), ]

giving:

      id frame time         A        B          C
1.1.1  1     1    1 50.940395 4.141713 -1.2947360
1.1.2  1     1    2 51.324398 4.271260  0.6174782
1.2.1  1     2    1 33.117691 5.044080  1.8203670
1.2.2  1     2    2 32.977860 5.506677  0.8811504
1.3.1  1     3    1 50.000625 8.584148 -1.2942450
1.3.2  1     3    2 50.603195 8.099262  0.6418580
2.1.1  2     1    1 61.675927 5.269216 -6.0028560
2.1.2  2     1    2 61.996378 6.186417 -6.5428624
2.2.1  2     2    1  5.514353 6.570010  5.1997280
2.2.2  2     2    2  4.798275 4.955662  5.1502535
2.3.1  2     3    1 51.580086 5.683788  9.8316630
2.3.2  2     3    2 50.717459 5.430070 10.9601541

Note: The input in reproducible form is the following -- the code generating the input in the question is not reproducible because random numbers were used without set.seed.

df_wide <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L), frame = c(1L, 
2L, 3L, 1L, 2L, 3L), A1 = c(50.940395, 33.117691, 50.000625, 
61.675927, 5.514353, 51.580086), B1 = c(4.141713, 5.04408, 8.584148, 
5.269216, 6.57001, 5.683788), C1 = c(-1.294736, 1.820367, -1.294245, 
-6.002856, 5.199728, 9.831663), A2 = c(51.324398, 32.97786, 50.603195, 
61.996378, 4.798275, 50.717459), B2 = c(4.27126, 5.506677, 8.099262, 
6.186417, 4.955662, 5.43007), C2 = c(0.6174782, 0.8811504, 0.641858, 
-6.5428624, 5.1502535, 10.9601541)), .Names = c("id", "frame", 
"A1", "B1", "C1", "A2", "B2", "C2"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))

Upvotes: 2

akuiper
akuiper

Reputation: 215137

The tidyr approach would be: 1) gather the measures columns, 2) separate the headers into movements (alpha) + system(numeric) using extract with regex, 3) spread movements to headers:

library(tidyr)

df_wide %>% 
    gather(keys, values, -id, -frame) %>% 
    extract(keys, c("movements", "system"), "([a-zA-Z]+)([0-9]+)") %>% 
    spread(movements, values)

#   id frame system         A        B         C
#1   1     1      1 62.175823 9.661748 -9.120404
#2   1     1      2 62.957358 9.229938 -8.814429
#3   1     2      1 22.463641 3.904546  4.059267
#4   1     2      2 22.798492 3.045190  4.663611
#5   1     3      1 13.897632 6.675986 -9.528184
#6   1     3      2 15.036539 6.964412 -8.920507
#7   2     1      1 38.765030 7.735174  8.373283
#8   2     1      2 40.124285 4.947368 10.143035
#9   2     2      1  5.924254 9.358200  9.866305
#10  2     2      2  5.197255 9.859347 10.088928
#11  2     3      1 29.961107 7.451472 -3.143658
#12  2     3      2 31.322740 8.328626 -2.050261
#13  3     1      1 71.010782 6.909414  7.128306
#14  3     1      2 69.860047 7.675693  7.817473
#15  3     2      1 64.985282 1.596932 -3.422237
#16  3     2      2 64.839996 2.828168 -3.826748
#17  3     3      1 70.631159 1.238806  5.398818
#18  3     3      2 70.963814 1.255340  3.728302

Upvotes: 2

Related Questions