Reputation: 1335
I want to replace multiple letters/words with a single letter/word, multiple times in a dataframe. As an example,
Some data:
df = data.frame(
a = 1:8,
b = c("colour1 o", "colour2 O", "colour3 out", "colour4 Out",
"soundi i", "soundr I", "sounde in", "soundw In"))
df
a b
1 1 colour1 o
2 2 colour2 O
3 3 colour3 out
4 4 colour4 Out
5 5 soundi i
6 6 soundr I
7 7 sounde in
8 8 soundw In
Here is what I want to replace with:
df_repl <- list(
O = c("o", "out", "Out"),
In = c("i", "in", "I"))
So in df$b
o
, out
and Out
should become O
and i
, in
and I
become In
, but only if they are separated from any other words by a space, so o
in colour
is not capitalised.
This gets me half way there, but I think I need another nested for-loop to move through df_repl
...
for (word in df_repl[[1]]){
patt <- paste0('\\b', word, '\\b')
repl <- paste(names(df_repl[1]))
df$b <- gsub(patt, repl, df$b)
}
df
a b
1 1 colour1 O
2 2 colour2 O
3 3 colour3 O
4 4 colour4 O
5 5 soundi i
6 6 soundr I
7 7 sounde in
8 8 soundw In
Above o
, out
and Out
become O
but i
, in
and I
are not altered, here is the desired output:
a b
1 1 colour1 O
2 2 colour2 O
3 3 colour3 O
4 4 colour4 O
5 5 soundi In
6 6 soundr In
7 7 sounde In
8 8 soundw In
In the real data there are many more than two replacement words/letters so I can't just rerun the for-loop again. I'm not tied to a for-loop solution, but preferably using base R, any suggestions much appreciated.
EDIT
Trying to clarify my question:
Whenever one of o
, out
or Out
occur in df$b
I want to replace it with O
Whenever one of i
, in
or I
occur in df$b
I want to replace it with In
I can achieve the desired output like this:
for (word in df_repl[[1]]){
patt <- paste0('\\b', word, '\\b')
repl <- paste(names(df_repl[1]))
df$b <- gsub(patt, repl, df$b)
}
for (word in df_repl[[2]]){
patt <- paste0('\\b', word, '\\b')
repl <- paste(names(df_repl[2]))
df$b <- gsub(patt, repl, df$b)
}
But in my real dataset df_repl
is length 50 rather two so I don't want to copy/paste/edit/rerun the for-loop 50 times
Upvotes: 2
Views: 469
Reputation: 39737
You can skip the loop over the words in df_repl
when you paste them with |
(or) between the words like:
for(i in names(df_repl)) {
df$b <- sub(paste(paste0("\\b",df_repl[[i]],"\\b"), collapse = "|")
, i, df$b)
}
df
# a b
#1 1 colour1 O
#2 2 colour2 O
#3 3 colour3 O
#4 4 colour4 O
#5 5 soundi In
#6 6 soundr In
#7 7 sounde In
#8 8 soundw In
Upvotes: 1
Reputation: 6769
This is another solution:
library(stringr)
in1 <- str_split(df$b, " ", simplify = TRUE)[,1]
in2 <- str_split(df$b, " ", simplify = TRUE)[,2]
in2[in2 %in% c("o", "out", "Out")] <- "O"
in2[in2 %in% c("i", "in", "I")] <- "In"
df$b <- paste(in1, in2, sep=" ")
df
If you have a long list of words in your data, you could also move c(word list)
outside:
in1<- str_split(df$b, " ", simplify = TRUE)[,1]
in2<- str_split(df$b, " ", simplify = TRUE)[,2]
o <- c("o", "Out", "Out")
i <- c("i", "in", "I")
in2[in2 %in% o] <- "O"
in2[in2 %in% i] <- "In"
df$b <- paste(in1, in2, sep=" ")
df
> df
a b
1 1 colour1 O
2 2 colour2 O
3 3 colour3 O
4 4 colour4 O
5 5 soundi In
6 6 soundr In
7 7 sounde In
8 8 soundw In
Upvotes: 1
Reputation: 522762
You may try using three separate calls to sub
:
df$b <- sub("\\bo\\b", "i", df$b)
df$b <- sub("\\bout\\b", "in", df$b)
df$b <- sub("\\bOut\\b", "I", df$b)
df
a b
1 1 colour1 i
2 2 colour2 O
3 3 colour3 in
4 4 colour4 I
5 5 soundi i
6 6 soundr I
7 7 sounde in
8 8 soundw In
To automate this, you could try using sapply
with an index:
terms_in <- c("o", "out", "Out")
pat <- paste0("\\b", terms_in, "\\b")
replace <- c("i", "in", "I")
sapply(seq_along(pat), function(x) {
df$b <<- sub(pat[x], replace[x], df$b)
})
Upvotes: 1