pingu87
pingu87

Reputation: 133

R - swap sequences after and before the underscore to have a consistent string in a dataframe

I have a dataframe (R) where I have around 6000 different strings (column S). It might happen that the string before the underscore is swapped and you can find it the string after the underscore. Here an example:

df <- data.frame(val= sample(1:100, 50, replace=T),
            S=c(paste0(rep('A',5),'_',rep('B',5)),
                paste0(rep('C',10),'_',rep('D',10)),
                paste0(rep('B',3),'_', rep('A',3)),
                paste0(rep('C',7),'_', rep('A',7)),
                paste0(rep('E',20),'_',rep('F',20)),
                paste0(rep('F',5),'_', rep('G',5))))

I need to put always the string before and after the underscore in the same order, because they are actually the same (it doesn't matter the order, it is important that they are in the same order). I am loosing my mind on how to do it.

I was trying to split the string S and adding the two string information:

> df$L1 <- unlist(lapply(strsplit(df$S,'_'), function(x) x[1])) 
> df$L2 <-unlist(lapply(strsplit(df$S,'_'), function(x) x[2]))

I was thinking to apply a for loop but I don't know how to check all the different combinations and store the information.

Could you help me?

Upvotes: 1

Views: 39

Answers (2)

TarJae
TarJae

Reputation: 79174

Update: (thanks to @onyambu and @akrun):

library(dplyr)

df %>% 
  separate(S, into=c("L1", "L2"), remove = FALSE) %>% 
  mutate(x = match(L1, LETTERS),
         y = match(L2, LETTERS)) %>% 
  mutate(L1a = ifelse(x < y, L1, L2),
         L2a = ifelse(x < y, L2, L1)) %>% 
  mutate(S_new = paste(L1a, L2a, sep = "_"), .after=2) %>% 
  select(1:3)
val   S S_new
1   25 A_B   A_B
2   73 A_B   A_B
3    4 A_B   A_B
4   28 A_B   A_B
5   95 A_B   A_B
6   45 C_D   C_D
7   64 C_D   C_D
8   86 C_D   C_D
9   85 C_D   C_D
10  53 C_D   C_D
11  74 C_D   C_D
12  15 C_D   C_D
13  23 C_D   C_D
14  23 C_D   C_D
15  23 C_D   C_D
16  86 B_A   A_B
17  64 B_A   A_B
18  68 B_A   A_B
19  69 C_A   A_C
20  39 C_A   A_C
21  55 C_A   A_C
22  47 C_A   A_C
23  50 C_A   A_C
24  23 C_A   A_C
25  90 C_A   A_C
26  10 E_F   E_F
27  19 E_F   E_F
28   1 E_F   E_F
29  37 E_F   E_F
30  39 E_F   E_F
31   9 E_F   E_F
32  78 E_F   E_F
33  20 E_F   E_F
34  90 E_F   E_F
35  93 E_F   E_F
36   9 E_F   E_F
37  64 E_F   E_F
38  84 E_F   E_F
39   3 E_F   E_F
40  46 E_F   E_F
41  21 E_F   E_F
42  88 E_F   E_F
43  30 E_F   E_F
44  95 E_F   E_F
45  63 E_F   E_F
46  81 F_G   F_G
47  81 F_G   F_G
48  62 F_G   F_G
49  35 F_G   F_G
50  92 F_G   F_G

For this we could use separate from tidyr:

library(dplyr)
library(tidyr)
df %>% 
  separate(S, into=c("L1", "L2"), sep = "_", remove = FALSE)
   val   S L1 L2
1   40 A_B  A  B
2   64 A_B  A  B
3   49 A_B  A  B
4   71 A_B  A  B
5   98 A_B  A  B
6    1 C_D  C  D
7   53 C_D  C  D
8   99 C_D  C  D
9   13 C_D  C  D
10  70 C_D  C  D
11  82 C_D  C  D
12  49 C_D  C  D
13  42 C_D  C  D
14  44 C_D  C  D
15  57 C_D  C  D
16  57 B_A  B  A
17  92 B_A  B  A
18  83 B_A  B  A
19  43 C_A  C  A
20  45 C_A  C  A
21  70 C_A  C  A
22  97 C_A  C  A
23  28 C_A  C  A
24  18 C_A  C  A
25  18 C_A  C  A
26  54 E_F  E  F
27  19 E_F  E  F
28  70 E_F  E  F
29  43 E_F  E  F
30  69 E_F  E  F
31  50 E_F  E  F
32  30 E_F  E  F
33  36 E_F  E  F
34  96 E_F  E  F
35  48 E_F  E  F
36  39 E_F  E  F
37  77 E_F  E  F
38  91 E_F  E  F
39  83 E_F  E  F
40  95 E_F  E  F
41  33 E_F  E  F
42  74 E_F  E  F
43  27 E_F  E  F
44   8 E_F  E  F
45  19 E_F  E  F
46  34 F_G  F  G
47  75 F_G  F  G
48  67 F_G  F  G
49  17 F_G  F  G
50   4 F_G  F  G

Upvotes: 2

Andre Wildberg
Andre Wildberg

Reputation: 19163

If the order doesn't matter sort might be helpful here.

cbind(df, S_new = sapply(strsplit(df$S, "_"), function(x) 
  paste(sort(x), collapse="_")))
   val   S S_new
1   44 A_B   A_B
2   84 A_B   A_B
3   78 A_B   A_B
4   95 A_B   A_B
5   87 A_B   A_B
6   70 C_D   C_D
7   34 C_D   C_D
8   55 C_D   C_D
9   94 C_D   C_D
10  94 C_D   C_D
11   7 C_D   C_D
12  14 C_D   C_D
13  60 C_D   C_D
14  58 C_D   C_D
15  37 C_D   C_D
16   8 B_A   A_B
17  31 B_A   A_B
18  64 B_A   A_B
...

Upvotes: 3

Related Questions