Reputation: 6874
Let's say i have the following character string
c(">Date of Procedure 01/09/2018<", ">Date of Procedure 01/10/2018<",
">Date of Procedure 03/09/2018<", ">Date of Procedure 04/09/2018<",
"Patient name Bilbo baggins", "Patient name: Jonny Begood",
"Patient name Elma Fudd", "Patient name Miss Puddleduck", "Patient name: Itsy Bitsy",
"Patient name: Lala", "Type of procedure: OGD", "Type of procedure: OGD",
"Type of procedure: Colonoscopy", "Type of procedure Colonoscopy",
"Type of procedure: Colonoscopy", "Label 35252", "Label 543 ",
"Label 5254 ", "Label 23", "Label 555555 ", "Label 54354")
I want to extract only those words or phrases that are shared between strings so that the result should be: "Date of Procedure"
,"Patient name"
,"Type of procedure"
,"Label"
. I tried using tidytext
but it forces me to say the n-gram size I want whereas there may be one, two or three word phrases that are shared.
Upvotes: 2
Views: 67
Reputation: 23608
When using unnest_tokens
from tidytext with ngrams, you can't specify to remove numbers or other not wanted characters. Switching to the quanteda package will help you in this case. Comments in code for explanations.
library(quanteda)
text <- c(">Date of Procedure 01/09/2018<", ">Date of Procedure 01/10/2018<",
">Date of Procedure 03/09/2018<", ">Date of Procedure 04/09/2018<",
"Patient name Bilbo baggins", "Patient name: Jonny Begood",
"Patient name Elma Fudd", "Patient name Miss Puddleduck", "Patient name: Itsy Bitsy",
"Patient name: Lala", "Type of procedure: OGD", "Type of procedure: OGD",
"Type of procedure: Colonoscopy", "Type of procedure Colonoscopy",
"Type of procedure: Colonoscopy", "Label 35252", "Label 543 ",
"Label 5254 ", "Label 23", "Label 555555 ", "Label 54354")
# tokenize text and remove punctuation and numbers
toks <- tokens(text, remove_numbers = TRUE, remove_punct = TRUE)
# create 1, 2 and 3 ngrams.
toks_grams <- tokens_ngrams(toks, n = 1:3)
# transform into a document feature matrix (step can be included in next one)
my_dfm <- dfm(toks_grams)
# turn the terms into a frequency table and filter out the ones that have a count of 1
# depending on needs you can filter out words ngrams or choose a higher occuring frequency to filter on.
freqs <- textstat_frequency(my_dfm)
freqs[freqs$frequency > 1, ]
feature frequency rank docfreq group
1 of 9 1 9 all
2 procedure 9 1 9 all
3 of_procedure 9 1 9 all
4 patient 6 4 6 all
5 name 6 4 6 all
6 patient_name 6 4 6 all
7 label 6 4 6 all
8 type 5 8 5 all
9 type_of 5 8 5 all
10 type_of_procedure 5 8 5 all
11 date 4 11 4 all
12 date_of 4 11 4 all
13 date_of_procedure 4 11 4 all
14 colonoscopy 3 14 3 all
15 procedure_colonoscopy 3 14 3 all
16 of_procedure_colonoscopy 3 14 3 all
17 ogd 2 17 2 all
18 procedure_ogd 2 17 2 all
19 of_procedure_ogd 2 17 2 all
Upvotes: 1