sweetmusicality
sweetmusicality

Reputation: 937

Find all unique strings in R

I am relatively new to R. I have a dataframe df that looks like this (one character variable only...my actual df spans 100k+ rows, but for simplicity, let's look at 5 rows only):

V1
oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects
angioedema chemically induced, angioedema chemically induced, oximetry
abo blood group system, imipramine poisoning, adverse effects
isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy
thrombosis drug therapy

I want to be able to output every single unique string so that it looks like this:

V1
oximetry
hydrogen peroxide adverse effects
epoprostenol adverse effects
angioedema chemically induced
abo blood group system
imipramine poisoning
adverse effects
isoenzymes
myocardial infarction drug therapy
thrombosis drug therapy

Do I use the tm package? I tried using dtm but my code was inefficient since it would convert dtm to matrix which would require a lot of memory from 100k+ rows.

Please advise. Thanks!

Upvotes: 1

Views: 5482

Answers (2)

KenHBS
KenHBS

Reputation: 7164

Only using base R, you can use strsplit() to split your large string at every "comma+space" or "\n". Then use unique() to only return unique strings:

text_vec <- c("oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects
angioedema chemically induced, angioedema chemically induced, oximetry
abo blood group system, imipramine poisoning, adverse effects
isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy
thrombosis drug therapy")

strsplit(text_vec, ", |\\n")[[1]])
# [1] "oximetry"                           "hydrogen peroxide adverse effects" 
# [3] "epoprostenol adverse effects"       "angioedema chemically induced"     
# [5] "angioedema chemically induced"      "oximetry"                          
# [7] "abo blood group system"             "imipramine poisoning"              
# [9] "adverse effects"                    "isoenzymes"                        
# [11] "myocardial infarction drug therapy" "thrombosis drug therapy"           
# [13] "thrombosis drug therapy"   

unique(strsplit(text_vec, ", |\\n")[[1]])
# [1] "oximetry"                           "hydrogen peroxide adverse effects" 
# [3] "epoprostenol adverse effects"       "angioedema chemically induced"     
# [5] "abo blood group system"             "imipramine poisoning"              
# [7] "adverse effects"                    "isoenzymes"                        
# [9] "myocardial infarction drug therapy" "thrombosis drug therapy" 

Upvotes: 2

Alex P
Alex P

Reputation: 1494

try this:

library(stringr)
library(tidyverse)

df <- data.frame(variable = c(
'oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects',
'angioedema chemically induced, angioedema chemically induced, oximetry',
'abo blood group system, imipramine poisoning, adverse effects',
'isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy',
'thrombosis drug therapy'), stringsAsFactors=FALSE)

mutate(df, variable = str_split(variable, ', ')) %>%
  unnest() %>% distinct()

Upvotes: 4

Related Questions