kaix
kaix

Reputation: 305

How to identify the text that are in common between sentences?

I would like to find the text or string that appeared in 3 of my columns.

> dput(df1)
structure(list(Jan = "The price of oil declined.", Feb = "The price of gold declined.", 
Mar = "Prices remained unchanged."), row.names = c(NA, -1L
), class = c("tbl_df", "tbl", "data.frame"))

I want to get something like

Word       Count
The        2
price      3
declined   2
of         2

Thank you.

Upvotes: 0

Views: 73

Answers (4)

hello_friend
hello_friend

Reputation: 5788

Base R solution:

setNames(
  data.frame(
    table(
        unlist(strsplit(tolower(do.call(c, df1)), "\\s+|[[:punct:]]"))
      )
    ),
  c("Words", "Frequency")
)

Upvotes: 1

PKumar
PKumar

Reputation: 11128

May be this:

setNames(data.frame(table(unlist
  (strsplit
    (trimws(tolower(stack(df)$values),whitespace = '\\.'), '\\s+',   perl=TRUE)
    )
   )
  ), c('words', 'Frequency'))

stack(df) will stack the df to columnar structure from row structure, then using values column we get all the sentences. we use trimws to remove all the unnecessary punctuation. we use strsplit to split data with spaces. Finally unlisting it to make it flatten. Taking the table and then converting to data.frame yields the desired results.setNames renames the columns.

Output:

#      words Frequency
#1  declined         2
#2      gold         1
#3        of         2
#4       oil         1
#5     price         2
#6    prices         1
#7  remained         1
#8       the         2
#9 unchanged         1

Upvotes: 2

user2974951
user2974951

Reputation: 10375

This code won't process the data as you may wish, for ex. treating "price" and "Prices" as the same word. If you want that it will get more complicated.

> data.frame(table(strsplit(tolower(gsub("\\.|\\,","",paste(as.character(unlist(df)),collapse=" ")))," ")))
       Var1 Freq
1  declined    2
2      gold    1
3        of    2
4       oil    1
5     price    2
6    prices    1
7  remained    1
8       the    2
9 unchanged    1

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388962

You can count the occurrence of each word in the text and keep only the ones that occur more than once.

library(dplyr)
library(tidyr)

library(dplyr)
library(tidyr)

df1 %>%
  pivot_longer(cols = everything()) %>%
  separate_rows(value, sep = '\\s+') %>%
  mutate(value = tolower(gsub('[[:punct:]]', '', value))) %>%
  count(value) %>%
  filter(n > 1)

Upvotes: 2

Related Questions