Reputation: 937
I am relatively new to R. I have a dataframe df
that looks like this (one character variable only...my actual df spans 100k+ rows, but for simplicity, let's look at 5 rows only):
V1
oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects
angioedema chemically induced, angioedema chemically induced, oximetry
abo blood group system, imipramine poisoning, adverse effects
isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy
thrombosis drug therapy
I want to be able to output every single unique string so that it looks like this:
V1
oximetry
hydrogen peroxide adverse effects
epoprostenol adverse effects
angioedema chemically induced
abo blood group system
imipramine poisoning
adverse effects
isoenzymes
myocardial infarction drug therapy
thrombosis drug therapy
Do I use the tm
package? I tried using dtm
but my code was inefficient since it would convert dtm
to matrix which would require a lot of memory from 100k+ rows.
Please advise. Thanks!
Upvotes: 1
Views: 5482
Reputation: 7164
Only using base R, you can use strsplit()
to split your large string at every "comma+space" or "\n". Then use unique()
to only return unique strings:
text_vec <- c("oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects
angioedema chemically induced, angioedema chemically induced, oximetry
abo blood group system, imipramine poisoning, adverse effects
isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy
thrombosis drug therapy")
strsplit(text_vec, ", |\\n")[[1]])
# [1] "oximetry" "hydrogen peroxide adverse effects"
# [3] "epoprostenol adverse effects" "angioedema chemically induced"
# [5] "angioedema chemically induced" "oximetry"
# [7] "abo blood group system" "imipramine poisoning"
# [9] "adverse effects" "isoenzymes"
# [11] "myocardial infarction drug therapy" "thrombosis drug therapy"
# [13] "thrombosis drug therapy"
unique(strsplit(text_vec, ", |\\n")[[1]])
# [1] "oximetry" "hydrogen peroxide adverse effects"
# [3] "epoprostenol adverse effects" "angioedema chemically induced"
# [5] "abo blood group system" "imipramine poisoning"
# [7] "adverse effects" "isoenzymes"
# [9] "myocardial infarction drug therapy" "thrombosis drug therapy"
Upvotes: 2
Reputation: 1494
try this:
library(stringr)
library(tidyverse)
df <- data.frame(variable = c(
'oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects',
'angioedema chemically induced, angioedema chemically induced, oximetry',
'abo blood group system, imipramine poisoning, adverse effects',
'isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy',
'thrombosis drug therapy'), stringsAsFactors=FALSE)
mutate(df, variable = str_split(variable, ', ')) %>%
unnest() %>% distinct()
Upvotes: 4