arielle
arielle

Reputation: 945

R: Identifying column pairs and computing the differences

I have a data frame (df) like this (find dput below):

X   A_T  B_N  B_T C_N   C_T  C_T.1  C_T.2  D_N  D_T  D_T.1  E_T  F_N
w   1    5    9   13    17   21     25     29   33   37     41   45
x   2    6   10   14    18   22     26     30   34   38     42   46
y   3    7   11   15    19   23     27     31   35   39     43   47
z   4    8   12   16    20   24     28     32   36   40     44   48

Were columns (after the first one) have a prefix (here A to F, but could be any other string containing characters, numbers or symbols), then a "_", and then either "T" or "N". Then IDs ending with _T can also have ".1" or ".2" at the end.

What I would like to do is to first identify all possible pairs of IDs of the same prefix, in a table like this (where m1 and m2 simply mean member1 and member2 of each pair, m1 being the "_N" member and m2 the "_T" member, and "pair" is the name of the pair):

m1    m2     pair
B_N   B_T    B
C_N   C_T    C
C_N   C_T.1  C.1
C_N   C_T.2  C.2
D_N   D_T    D
D_N   D_T.1  D.1

Note that IDs not belonging to a pair are discarded (like A, E and F).

Then I would like to compute the difference between the columns (from df) of m2 and m1, for each pair, and put it in a new table where each column is one pair, like this:

X   B   C   C.1   C.2   D   D.1
w   4   4   8     12    4   8
x   4   4   8     12    4   8
y   4   4   8     12    4   8
z   4   4   8     12    4   8

Please note that the solution needs to be applicable to bigger dataframes where we cannot indicate the values individually, and to dataframes where the ID prefixes could be any string containing characters, numbers or symbols.

Please let me know if there is anything I can clarify, and thank you for your help!

dput of df:

> dput(df)
structure(c("w", "x", "y", "z", "1", "2", "3", "4", "5", "6", 
"7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", 
"18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", 
"29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", 
"40", "41", "42", "43", "44", "45", "46", "47", "48"), .Dim = c(4L, 
13L), .Dimnames = list(NULL, c("X", "A_T", "B_N", "B_T", "C_N", 
"C_T", "C_T.1", "C_T.2", "D_N", "D_T", "D_T.1", "E_T", "F_N")), class = "noquote")

Upvotes: 2

Views: 63

Answers (1)

austensen
austensen

Reputation: 3017

Here is a way you can do this using tidyverse tools.

tidyr is used for reshaping the data - first to long, then back to wide.

dplyr is used for for general data manipulation, including the inner_join.

stringr is just used for string manipulation creating the pair variable.

library(dplyr)
library(tidyr)
library(stringr)

Create Data

df_orig <- readr::read_table2(
"X   A_T  B_N  B_T C_N   C_T  C_T.1  C_T.2  D_N  D_T  D_T.1  E_T  F_N
w   1    5    9   13    17   21     25     29   33   37     41   45
x   2    6   10   14    18   22     26     30   34   38     42   46
y   3    7   11   15    19   23     27     31   35   39     43   47
z   4    8   12   16    20   24     28     32   36   40     44   48")

Reshape data to long

df_long <- df_orig %>% 
  gather("key", "value", -X) %>% 
  separate(key, c("prefix", "suffix"), sep = "_")

df_long

#> # A tibble: 48 x 4
#>        X prefix suffix value
#>  * <chr>  <chr>  <chr> <int>
#>  1     w      A      T     1
#>  2     x      A      T     2
#>  3     y      A      T     3
#>  4     z      A      T     4
#>  5     w      B      N     5
#>  6     x      B      N     6
#>  7     y      B      N     7
#>  8     z      B      N     8
#>  9     w      B      T     9
#> 10     x      B      T    10
#> # ... with 38 more rows

Create separate tables for N and T

df_n <- df_long %>% 
  filter(suffix == "N")

df_t <- df_long %>% 
  filter(suffix != "N")

Merge N and T tables to get pairs, calculate differences, reshape wide

Here you should use and inner_join to drop the cases that only appear in one of the other, like A, E, and F

df_merged <- inner_join(df_n, df_t, by = c("X", "prefix"), suffix = c("_m1", "_m2")) %>% 
  mutate(pair = str_c(prefix, str_sub(suffix_m2, 2)),
         value = value_m2 - value_m1) %>% 
  select(X, pair, value) %>% 
  spread(pair, value)

df_merged

#> # A tibble: 4 x 7
#>       X     B     C   C.1   C.2     D   D.1
#> * <chr> <int> <int> <int> <int> <int> <int>
#> 1     w     4     4     8    12     4     8
#> 2     x     4     4     8    12     4     8
#> 3     y     4     4     8    12     4     8
#> 4     z     4     4     8    12     4     8

Upvotes: 1

Related Questions