Reputation: 1932
First up - I don´t think this is a duplicate post. I have found several great posts and webpages on transforming multiple columns from wide to long format, but none of them are similar to my data since they deal with several columns that are measures of the same thing (in my example that would be A1, A2, A3, A4), and also they do not already contain a long format variable (frame in my case).
Here is my question:
I am working with a dataset containing many variabels measured by two different motion capture systems. Currently my dataset is in wide format, but I´ve realized ggplot works much better with long format, and I therefore wish to transform my data.
Here is a very simplified version of my data:
id <- (rep(1:3, each = 3))
frame <- (rep(1:3, 3))
A1 <- runif(9, min =1, max =100)
B1 <- runif(9, min =1, max =10)
C1 <- runif(9, min =-10, max =10)
A2 <- rnorm(9, mean = A1, sd=1)
B2 <- rnorm(9, mean = B1, sd=1)
C2 <- rnorm(9, mean = C1, sd=1)
df_wide <- as.data.frame.matrix(cbind(id, frame, A1, B1, C1, A2, B2, C2))
rm(id, frame, A1, A2, B1, B2, C1, C2)
df_wide$id <- as.factor(df_wide$id)
df_wide$frame <- as.factor(df_wide$frame)
head(df_wide)
id frame A1 B1 C1 A2 B2 C2
1 1 1 50.940395 4.141713 -1.294736 51.324398 4.271260 0.6174782
2 1 2 33.117691 5.044080 1.820367 32.977860 5.506677 0.8811504
3 1 3 50.000625 8.584148 -1.294245 50.603195 8.099262 0.6418580
4 2 1 61.675927 5.269216 -6.002856 61.996378 6.186417 -6.5428624
5 2 2 5.514353 6.570010 5.199728 4.798275 4.955662 5.1502535
6 2 3 51.580086 5.683788 9.831663 50.717459 5.430070 10.9601541
A1 and A2, B1 and B2, are measures of the same type of movements (A and B) made by system 1 and system 2. Each patient has been measured several times as indicated by the frame variable.
I would like my data.frame to look like this:
id frame system A B C
1 1 1 1
2 1 1 2
3 1 2 1
4 1 2 2
5 1 3 1
6 1 3 2
I have two problems keeping me from solving this:
1) measurements between the two systems are not placed next to each other. Therefore I cant use code like this:
library(tidyr)
df_long <- gather(df_wide, System, A, A1:A2, factor_key=TRUE)
2) My dataset contains close to 120 variabels, so I would like a solution that wouldn't require me to write the code for each variable. I am considering making a loop to solve this, but any help in this regard will be much appreciated as well.
Upvotes: 1
Views: 76
Reputation: 270368
Run reshape
and then sort the result.
The first 4 lines set up the arguments to reshape
. In particular, varying
is list(A = c("A1", "A2"), B = c("B1", "B2"), C = c("C1", "C2"))
. The last line of code sorts the rows and that could be omitted if the row order is unimportant.
Here the A... columns have the same type as the B... columns and the C... columns but this solution would continue to work even if that were not the case.
No packages are used.
This SO question/answer is similar but there are small differences: Gather multiple date/value columns using tidyr
idvar <- 1:2
nms <- names(df_wide)[-idvar] # names of non-id variables
varying <- split(nms, sub("\\d+$", "", nms))
v.names <- names(varying)
r <- reshape(df_wide, dir = "long", varying = varying, v.names = v.names, idvar = idvar)
r[order(r$id, r$frame), ]
giving:
id frame time A B C
1.1.1 1 1 1 50.940395 4.141713 -1.2947360
1.1.2 1 1 2 51.324398 4.271260 0.6174782
1.2.1 1 2 1 33.117691 5.044080 1.8203670
1.2.2 1 2 2 32.977860 5.506677 0.8811504
1.3.1 1 3 1 50.000625 8.584148 -1.2942450
1.3.2 1 3 2 50.603195 8.099262 0.6418580
2.1.1 2 1 1 61.675927 5.269216 -6.0028560
2.1.2 2 1 2 61.996378 6.186417 -6.5428624
2.2.1 2 2 1 5.514353 6.570010 5.1997280
2.2.2 2 2 2 4.798275 4.955662 5.1502535
2.3.1 2 3 1 51.580086 5.683788 9.8316630
2.3.2 2 3 2 50.717459 5.430070 10.9601541
Note: The input in reproducible form is the following -- the code generating the input in the question is not reproducible because random numbers were used without set.seed
.
df_wide <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L), frame = c(1L,
2L, 3L, 1L, 2L, 3L), A1 = c(50.940395, 33.117691, 50.000625,
61.675927, 5.514353, 51.580086), B1 = c(4.141713, 5.04408, 8.584148,
5.269216, 6.57001, 5.683788), C1 = c(-1.294736, 1.820367, -1.294245,
-6.002856, 5.199728, 9.831663), A2 = c(51.324398, 32.97786, 50.603195,
61.996378, 4.798275, 50.717459), B2 = c(4.27126, 5.506677, 8.099262,
6.186417, 4.955662, 5.43007), C2 = c(0.6174782, 0.8811504, 0.641858,
-6.5428624, 5.1502535, 10.9601541)), .Names = c("id", "frame",
"A1", "B1", "C1", "A2", "B2", "C2"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
Upvotes: 2
Reputation: 215137
The tidyr
approach would be: 1) gather the measures columns, 2) separate the headers into movements
(alpha) + system
(numeric) using extract
with regex, 3) spread movements
to headers:
library(tidyr)
df_wide %>%
gather(keys, values, -id, -frame) %>%
extract(keys, c("movements", "system"), "([a-zA-Z]+)([0-9]+)") %>%
spread(movements, values)
# id frame system A B C
#1 1 1 1 62.175823 9.661748 -9.120404
#2 1 1 2 62.957358 9.229938 -8.814429
#3 1 2 1 22.463641 3.904546 4.059267
#4 1 2 2 22.798492 3.045190 4.663611
#5 1 3 1 13.897632 6.675986 -9.528184
#6 1 3 2 15.036539 6.964412 -8.920507
#7 2 1 1 38.765030 7.735174 8.373283
#8 2 1 2 40.124285 4.947368 10.143035
#9 2 2 1 5.924254 9.358200 9.866305
#10 2 2 2 5.197255 9.859347 10.088928
#11 2 3 1 29.961107 7.451472 -3.143658
#12 2 3 2 31.322740 8.328626 -2.050261
#13 3 1 1 71.010782 6.909414 7.128306
#14 3 1 2 69.860047 7.675693 7.817473
#15 3 2 1 64.985282 1.596932 -3.422237
#16 3 2 2 64.839996 2.828168 -3.826748
#17 3 3 1 70.631159 1.238806 5.398818
#18 3 3 2 70.963814 1.255340 3.728302
Upvotes: 2