Reputation: 945
I have a data frame (df) like this (find dput below):
X A_T B_N B_T C_N C_T C_T.1 C_T.2 D_N D_T D_T.1 E_T F_N
w 1 5 9 13 17 21 25 29 33 37 41 45
x 2 6 10 14 18 22 26 30 34 38 42 46
y 3 7 11 15 19 23 27 31 35 39 43 47
z 4 8 12 16 20 24 28 32 36 40 44 48
Were columns (after the first one) have a prefix (here A to F, but could be any other string containing characters, numbers or symbols), then a "_", and then either "T" or "N". Then IDs ending with _T can also have ".1" or ".2" at the end.
What I would like to do is to first identify all possible pairs of IDs of the same prefix, in a table like this (where m1 and m2 simply mean member1 and member2 of each pair, m1 being the "_N" member and m2 the "_T" member, and "pair" is the name of the pair):
m1 m2 pair
B_N B_T B
C_N C_T C
C_N C_T.1 C.1
C_N C_T.2 C.2
D_N D_T D
D_N D_T.1 D.1
Note that IDs not belonging to a pair are discarded (like A, E and F).
Then I would like to compute the difference between the columns (from df) of m2 and m1, for each pair, and put it in a new table where each column is one pair, like this:
X B C C.1 C.2 D D.1
w 4 4 8 12 4 8
x 4 4 8 12 4 8
y 4 4 8 12 4 8
z 4 4 8 12 4 8
Please note that the solution needs to be applicable to bigger dataframes where we cannot indicate the values individually, and to dataframes where the ID prefixes could be any string containing characters, numbers or symbols.
Please let me know if there is anything I can clarify, and thank you for your help!
dput of df:
> dput(df)
structure(c("w", "x", "y", "z", "1", "2", "3", "4", "5", "6",
"7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17",
"18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28",
"29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39",
"40", "41", "42", "43", "44", "45", "46", "47", "48"), .Dim = c(4L,
13L), .Dimnames = list(NULL, c("X", "A_T", "B_N", "B_T", "C_N",
"C_T", "C_T.1", "C_T.2", "D_N", "D_T", "D_T.1", "E_T", "F_N")), class = "noquote")
Upvotes: 2
Views: 63
Reputation: 3017
Here is a way you can do this using tidyverse
tools.
tidyr
is used for reshaping the data - first to long, then back to wide.
dplyr
is used for for general data manipulation, including the inner_join
.
stringr
is just used for string manipulation creating the pair
variable.
library(dplyr)
library(tidyr)
library(stringr)
df_orig <- readr::read_table2(
"X A_T B_N B_T C_N C_T C_T.1 C_T.2 D_N D_T D_T.1 E_T F_N
w 1 5 9 13 17 21 25 29 33 37 41 45
x 2 6 10 14 18 22 26 30 34 38 42 46
y 3 7 11 15 19 23 27 31 35 39 43 47
z 4 8 12 16 20 24 28 32 36 40 44 48")
df_long <- df_orig %>%
gather("key", "value", -X) %>%
separate(key, c("prefix", "suffix"), sep = "_")
df_long
#> # A tibble: 48 x 4
#> X prefix suffix value
#> * <chr> <chr> <chr> <int>
#> 1 w A T 1
#> 2 x A T 2
#> 3 y A T 3
#> 4 z A T 4
#> 5 w B N 5
#> 6 x B N 6
#> 7 y B N 7
#> 8 z B N 8
#> 9 w B T 9
#> 10 x B T 10
#> # ... with 38 more rows
N
and T
df_n <- df_long %>%
filter(suffix == "N")
df_t <- df_long %>%
filter(suffix != "N")
N
and T
tables to get pairs, calculate differences, reshape wideHere you should use and inner_join
to drop the cases that only appear in one of the other, like A
, E
, and F
df_merged <- inner_join(df_n, df_t, by = c("X", "prefix"), suffix = c("_m1", "_m2")) %>%
mutate(pair = str_c(prefix, str_sub(suffix_m2, 2)),
value = value_m2 - value_m1) %>%
select(X, pair, value) %>%
spread(pair, value)
df_merged
#> # A tibble: 4 x 7
#> X B C C.1 C.2 D D.1
#> * <chr> <int> <int> <int> <int> <int> <int>
#> 1 w 4 4 8 12 4 8
#> 2 x 4 4 8 12 4 8
#> 3 y 4 4 8 12 4 8
#> 4 z 4 4 8 12 4 8
Upvotes: 1