Reputation: 51
I have a dataframe containing a series of strings, some of which contain two-word phrases which I want to condense down to a single "pseudo word".
In this example, "united kingdom","saudi arabia", and "european union" are phrases which are of interest to me. I would like to replace all instances of "united kingdom" with "unitedkingdom", "saudi arabia" with "saudiarabia" and so on.
My dataframe of text strings is as follows:
text.df <- as.data.frame(
c(
"Lorem ipsum dolor sit saudi arabia adipiscing elit.", # Contains "saudi arabia"
"Ut enim ad minim veniam united kingdom exercitation.", # Contains "united kingdom"
"Excepteur sint european union deserunt saudi arabia laborum", # Contains "european union" and "saudi arabia"
"Sed ut perspiciatis unde omnis error sit voluptate." # Contains nothing of interest
))
colnames(text.df) <- 'content'
My lookup dataframe is as follows:
lookup <- data.frame(matrix(ncol = 2, nrow = 3))
lookup$X1 <- c('united kingdom', 'european union', 'saudi arabia')
lookup$X2 <- c('unitedkingdom', 'europeanunion', 'saudiarabia')
My aim is to return a dataframe which looks like:
> new.text.df
content
1 Lorem ipsum dolor sit saudiarabia adipiscing elit.
2 Ut enim ad minim veniam unitedkingdom exercitation.
3 Excepteur sint europeanunion deserunt saudiarabia laborum
4 Sed ut perspiciatis unde omnis error sit voluptate.
>
If anyone is able to help it would be greatly appreciated! Thanks in advance.
Upvotes: 2
Views: 73
Reputation: 341
You could try this :
library(stringr)
transform_word <- function(text){
for (i in 1:nrow(lookup))
text <- stringr::str_replace_all(text,lookup$X1[i],lookup$X2[i])
return(text)
}
text.df[,'content'] <- sapply(text.df[,'content'],transform_word)
Upvotes: 1
Reputation: 16121
library(qdap)
text.df <- as.data.frame(
c(
"Lorem ipsum dolor sit saudi arabia adipiscing elit.", # Contains "saudi arabia"
"Ut enim ad minim veniam united kingdom exercitation.", # Contains "united kingdom"
"Excepteur sint european union deserunt saudi arabia laborum", # Contains "european union" and "saudi arabia"
"Sed ut perspiciatis unde omnis error sit voluptate." # Contains nothing of interest
), stringsAsFactors = F)
colnames(text.df) <- 'content'
lookup <- data.frame(matrix(ncol = 2, nrow = 3))
lookup$X1 <- c('united kingdom', 'european union', 'saudi arabia')
lookup$X2 <- c('unitedkingdom', 'europeanunion', 'saudiarabia')
# provide patterns, replacements, actual texts to update
mgsub(lookup$X1, lookup$X2, text.df$content)
# [1] "Lorem ipsum dolor sit saudiarabia adipiscing elit."
# [2] "Ut enim ad minim veniam unitedkingdom exercitation."
# [3] "Excepteur sint europeanunion deserunt saudiarabia laborum"
# [4] "Sed ut perspiciatis unde omnis error sit voluptate."
Upvotes: 3