Reputation: 53
I'm trying to identify and aggregate synonyms for a given data set. Please see sample data below.
library(tm)
library(SnowballC)
dataset <- c("dad glad accept large admit large accept dad big large big accept big accept dad dad Happy dad accept glad papa dad Happy dad glad dad dad papa admit Happy big accept accept big accept dad Happy admit Happy Happy glad Happy dad accept accept large daddy large accept large large large big daddy accept admit dad admit daddy dad admit dad admit Happy accept accept Happy daddy accept admit")
docs <- Corpus(VectorSource(dataset))
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
sort(rowSums(m),decreasing=TRUE)
Result:
accept dad happy admit large big daddy glad papa
15 14 9 8 8 6 4 4 2
I'd like to find synonyms for each of the above words using the wordnet package that I downloaded and installed. For example to get the synonym of "accept" I can do:
library(wordnet)
setDict("C:/Program Files (x86)/WordNet/2.1/dict")
filter <- getTermFilter("ExactMatchFilter", "accept", TRUE)
terms <- getIndexTerms("VERB", 1, filter)
getSynonyms(terms[[1]])
Result:
[1] "accept" "admit" "assume" "bear" "consent" "go for" "have" "live with"
[9] "swallow" "take" "take on" "take over"
Now, I'd like to combine these two results sets so that it groups synonyms in the following way. Mark the most common words (rank 1) for a given group and group by these words later on similar to this:
id word word_count syn_group rank
1 accept 15 1 1
5 admit 8 1 2
2 dad 14 2 1
8 daddy 4 2 2
9 papa 2 2 3
3 happy 9 3 1
7 glad 4 3 2
4 large 8 4 1
6 big 6 4 2
this then could be aggregated like this
id word word_count
1 accept 15+8
2 dad 14+4+2
3 happy 9+4
4 large 8+6
and the final result would be then
id word word_count
1 accept 23
2 dad 20
3 large 14
4 happy 13
I have faced several issues including getting GetIndexTerms to loop through through the words whether they are nouns ,verbs, etc. Hope this all makes sense? Any help would be much appreciated. Thank you.
Upvotes: 3
Views: 5232
Reputation: 2213
I have been able to extract french synonyms automatically from a website as follows :
library(stringr)
library(pagedown)
library(pdftools)
path_Save_PDF <- "D:\\"
base_Url <- "https://dictionary.reverso.net/french-synonyms/"
words <- c("fâché")
nb_Words <- length(words)
list_Text <- list()
for(i in 1 : nb_Words)
{
print(i)
pdf_File <- paste0(path_Save_PDF, words[i], ".pdf")
chrome_print(input = paste0(base_Url, words[i]), output = pdf_File)
list_Text[[i]] <- pdftools::pdf_text(pdf_File)
list_Text[[i]] <- strsplit(x = list_Text[[i]], split = "\n")
}
save(list_Text, file = "list_Text.RData")
list_Synonymes <- list()
for(i in 1 : nb_Words)
{
print(i)
id_Lines_Synonymes <- which(str_detect(string = list_Text[[i]][[1]], pattern = "[:space:]{4,8}\\d{1,2}"))
text_Synonymes <- list_Text[[i]][[1]][id_Lines_Synonymes]
text_Synonymes <- stringr::str_remove_all(text_Synonymes, pattern = "Facebook®(.*)Visit Site")
text_Synonymes <- stringr::str_remove_all(text_Synonymes, pattern = "and post updates\\.")
text_Synonymes <- stringr::str_remove_all(text_Synonymes, pattern = "par extension au sens figuré")
text_Synonymes <- stringr::str_remove_all(text_Synonymes, pattern = "details")
text_Synonymes <- stringr::str_remove_all(text_Synonymes, pattern = "\\d")
text_Synonymes <- stringr::str_squish(text_Synonymes)
text_Synonymes <- paste0(text_Synonymes, collapse = ",")
text_Synonymes <- stringr::str_replace_all(string = text_Synonymes, pattern = "\\,\\,", replacement = "\\,")
text_Synonymes <- base::strsplit(text_Synonymes, ",")[[1]]
list_Synonymes[[i]] <- text_Synonymes
}
names(list_Synonymes) <- words
list_Synonymes
list_Synonymes
$fâché
[1] "dépité"
[2] " grognon"
[3] " mécontent"
[4] " morfondu"
[5] " transi"
[6] " horripilé"
[7] " irrité"
[8] " contrarié"
[9] " ennuyé"
[10] " frissonnant"
[11] "navré"
[12] " désolé"
[13] "vexé"
[14] " indisposé"
[15] " piqué"
[16] "en colère"
[17] " mécontent"
[18] "désolé"
[19] " navré"
[20] "brouillé avec quelqu'un"
[21] " en froid"
[22] " être incompétent dans un domaine particulier"
[23] " ne rien comprendre"
Afterwards, it could be used to group synonyms together.
Upvotes: 0
Reputation: 2797
We can do the following using dplyr
library(dplyr)
df %>%
group_by(syn_group) %>%
mutate(sum_word_count = sum(word_count)) %>%
filter(rank == 1)
Data:
df <- read.table(text = "id word word_count syn_group rank
1 accept 15 1 1
5 admit 8 1 2
2 dad 14 2 1
8 daddy 4 2 2
9 papa 2 2 3
3 happy 9 3 1
7 glad 4 3 2
4 large 8 4 1
6 big 6 4 2", header = T)
Please next time post the output of dput
.
Edit: Here is some code to help you get started looping over the words and parts of speech, and storing the synonyms. What is left is to determine if a current term is a synonym of a previous one, in which case you already have the synonyms, and you can assign a unique syngroup. Next, you need to store some results. Finally, you need to calculate the rank, which is just seq_along
the synonyms and a grep
to determine the rank position. The comments are hints at where you might want to include code for these tips.
d <- data.frame(Term = row.names(m), word_count = m[,1])
all_pos <- c("ADJECTIVE", "ADVERB", "NOUN","VERB")
syns <- vector("list", length(all_pos))
for(w in seq(nrow(d))){
# if sysns of (d$Term[w]) has been calculated skip over current w
emf <- getTermFilter("ExactMatchFilter", d$Term[w], TRUE)
for(i in seq_along(syns)){
terms <- getIndexTerms(all_pos[i], 1, emf)
if(is.null(terms)){
syns[i] <- NA
} else{
syns[[i]] <- getSynonyms(terms[[1]])
}
}
# store the results of syns for current w
}
Upvotes: 1