Reputation: 133
I have a dataframe (R) where I have around 6000 different strings (column S). It might happen that the string before the underscore is swapped and you can find it the string after the underscore. Here an example:
df <- data.frame(val= sample(1:100, 50, replace=T),
S=c(paste0(rep('A',5),'_',rep('B',5)),
paste0(rep('C',10),'_',rep('D',10)),
paste0(rep('B',3),'_', rep('A',3)),
paste0(rep('C',7),'_', rep('A',7)),
paste0(rep('E',20),'_',rep('F',20)),
paste0(rep('F',5),'_', rep('G',5))))
I need to put always the string before and after the underscore in the same order, because they are actually the same (it doesn't matter the order, it is important that they are in the same order). I am loosing my mind on how to do it.
I was trying to split the string S and adding the two string information:
> df$L1 <- unlist(lapply(strsplit(df$S,'_'), function(x) x[1]))
> df$L2 <-unlist(lapply(strsplit(df$S,'_'), function(x) x[2]))
I was thinking to apply a for loop but I don't know how to check all the different combinations and store the information.
Could you help me?
Upvotes: 1
Views: 39
Reputation: 79174
Update: (thanks to @onyambu and @akrun):
library(dplyr)
df %>%
separate(S, into=c("L1", "L2"), remove = FALSE) %>%
mutate(x = match(L1, LETTERS),
y = match(L2, LETTERS)) %>%
mutate(L1a = ifelse(x < y, L1, L2),
L2a = ifelse(x < y, L2, L1)) %>%
mutate(S_new = paste(L1a, L2a, sep = "_"), .after=2) %>%
select(1:3)
val S S_new
1 25 A_B A_B
2 73 A_B A_B
3 4 A_B A_B
4 28 A_B A_B
5 95 A_B A_B
6 45 C_D C_D
7 64 C_D C_D
8 86 C_D C_D
9 85 C_D C_D
10 53 C_D C_D
11 74 C_D C_D
12 15 C_D C_D
13 23 C_D C_D
14 23 C_D C_D
15 23 C_D C_D
16 86 B_A A_B
17 64 B_A A_B
18 68 B_A A_B
19 69 C_A A_C
20 39 C_A A_C
21 55 C_A A_C
22 47 C_A A_C
23 50 C_A A_C
24 23 C_A A_C
25 90 C_A A_C
26 10 E_F E_F
27 19 E_F E_F
28 1 E_F E_F
29 37 E_F E_F
30 39 E_F E_F
31 9 E_F E_F
32 78 E_F E_F
33 20 E_F E_F
34 90 E_F E_F
35 93 E_F E_F
36 9 E_F E_F
37 64 E_F E_F
38 84 E_F E_F
39 3 E_F E_F
40 46 E_F E_F
41 21 E_F E_F
42 88 E_F E_F
43 30 E_F E_F
44 95 E_F E_F
45 63 E_F E_F
46 81 F_G F_G
47 81 F_G F_G
48 62 F_G F_G
49 35 F_G F_G
50 92 F_G F_G
For this we could use separate
from tidyr
:
library(dplyr)
library(tidyr)
df %>%
separate(S, into=c("L1", "L2"), sep = "_", remove = FALSE)
val S L1 L2
1 40 A_B A B
2 64 A_B A B
3 49 A_B A B
4 71 A_B A B
5 98 A_B A B
6 1 C_D C D
7 53 C_D C D
8 99 C_D C D
9 13 C_D C D
10 70 C_D C D
11 82 C_D C D
12 49 C_D C D
13 42 C_D C D
14 44 C_D C D
15 57 C_D C D
16 57 B_A B A
17 92 B_A B A
18 83 B_A B A
19 43 C_A C A
20 45 C_A C A
21 70 C_A C A
22 97 C_A C A
23 28 C_A C A
24 18 C_A C A
25 18 C_A C A
26 54 E_F E F
27 19 E_F E F
28 70 E_F E F
29 43 E_F E F
30 69 E_F E F
31 50 E_F E F
32 30 E_F E F
33 36 E_F E F
34 96 E_F E F
35 48 E_F E F
36 39 E_F E F
37 77 E_F E F
38 91 E_F E F
39 83 E_F E F
40 95 E_F E F
41 33 E_F E F
42 74 E_F E F
43 27 E_F E F
44 8 E_F E F
45 19 E_F E F
46 34 F_G F G
47 75 F_G F G
48 67 F_G F G
49 17 F_G F G
50 4 F_G F G
Upvotes: 2
Reputation: 19163
If the order doesn't matter sort
might be helpful here.
cbind(df, S_new = sapply(strsplit(df$S, "_"), function(x)
paste(sort(x), collapse="_")))
val S S_new
1 44 A_B A_B
2 84 A_B A_B
3 78 A_B A_B
4 95 A_B A_B
5 87 A_B A_B
6 70 C_D C_D
7 34 C_D C_D
8 55 C_D C_D
9 94 C_D C_D
10 94 C_D C_D
11 7 C_D C_D
12 14 C_D C_D
13 60 C_D C_D
14 58 C_D C_D
15 37 C_D C_D
16 8 B_A A_B
17 31 B_A A_B
18 64 B_A A_B
...
Upvotes: 3