Sebastian Zeki
Sebastian Zeki

Reputation: 6874

How to find phrases that are the same between strings in R

Let's say i have the following character string

c(">Date of Procedure 01/09/2018<", ">Date of Procedure 01/10/2018<", 
">Date of Procedure 03/09/2018<", ">Date of Procedure 04/09/2018<", 
"Patient name Bilbo baggins", "Patient name: Jonny Begood", 
"Patient name Elma Fudd", "Patient name Miss Puddleduck", "Patient name: Itsy Bitsy", 
"Patient name: Lala", "Type of procedure: OGD", "Type of procedure: OGD", 
"Type of procedure: Colonoscopy", "Type of procedure Colonoscopy", 
"Type of procedure: Colonoscopy", "Label 35252", "Label 543 ", 
"Label 5254 ", "Label 23", "Label 555555 ", "Label 54354")

I want to extract only those words or phrases that are shared between strings so that the result should be: "Date of Procedure","Patient name","Type of procedure","Label". I tried using tidytext but it forces me to say the n-gram size I want whereas there may be one, two or three word phrases that are shared.

Upvotes: 2

Views: 67

Answers (1)

phiver
phiver

Reputation: 23608

When using unnest_tokens from tidytext with ngrams, you can't specify to remove numbers or other not wanted characters. Switching to the quanteda package will help you in this case. Comments in code for explanations.

library(quanteda)
text <- c(">Date of Procedure 01/09/2018<", ">Date of Procedure 01/10/2018<", 
          ">Date of Procedure 03/09/2018<", ">Date of Procedure 04/09/2018<", 
          "Patient name Bilbo baggins", "Patient name: Jonny Begood", 
          "Patient name Elma Fudd", "Patient name Miss Puddleduck", "Patient name: Itsy Bitsy", 
          "Patient name: Lala", "Type of procedure: OGD", "Type of procedure: OGD", 
          "Type of procedure: Colonoscopy", "Type of procedure Colonoscopy", 
          "Type of procedure: Colonoscopy", "Label 35252", "Label 543 ", 
          "Label 5254 ", "Label 23", "Label 555555 ", "Label 54354")

# tokenize text and remove punctuation and numbers 
toks <- tokens(text, remove_numbers = TRUE, remove_punct = TRUE)

# create 1, 2 and 3 ngrams.
toks_grams <- tokens_ngrams(toks, n = 1:3)

# transform into a document feature matrix (step can be included in next one)    
my_dfm <- dfm(toks_grams)

# turn the terms into a frequency table and filter out the ones that have a count of 1
# depending on needs you can filter out words ngrams or choose a higher occuring frequency to filter on.
freqs <- textstat_frequency(my_dfm)
freqs[freqs$frequency > 1, ]


                    feature frequency rank docfreq group
1                        of         9    1       9   all
2                 procedure         9    1       9   all
3              of_procedure         9    1       9   all
4                   patient         6    4       6   all
5                      name         6    4       6   all
6              patient_name         6    4       6   all
7                     label         6    4       6   all
8                      type         5    8       5   all
9                   type_of         5    8       5   all
10        type_of_procedure         5    8       5   all
11                     date         4   11       4   all
12                  date_of         4   11       4   all
13        date_of_procedure         4   11       4   all
14              colonoscopy         3   14       3   all
15    procedure_colonoscopy         3   14       3   all
16 of_procedure_colonoscopy         3   14       3   all
17                      ogd         2   17       2   all
18            procedure_ogd         2   17       2   all
19         of_procedure_ogd         2   17       2   all

Upvotes: 1

Related Questions