Reputation: 235
I have a dataset with a column containing text as follows
Column1
----------------------------------------------------------
dapagliflozin 10 MG / metFORMIN hydrochloride
dapagliflozin 5 MG / metFORMIN hydrochloride
Fortamet
Glucophage
Glumetza
metFORMIN hydrochloride
metFORMIN hydrochloride / pioglitazone 15 MG
metFORMIN hydrochloride / pioglitazone 30 MG
I am trying to obtain the word count for every unique word, for example, word count for metFormin, word count for hydrochloride, etc. I need help; I tried table function, but it uses the whole row as one word and that's not helpful.
Upvotes: 1
Views: 1610
Reputation: 14902
Or use a text analysis package designed for this:
> require(quanteda)
> dfm(myColumn)
Creating a dfm from a character vector ...
... lowercasing
... tokenizing
... indexing 1 document
... shaping tokens into data.table, found 21 total tokens
... summing tokens by document
... indexing 8 feature types
... building sparse matrix
... created a 1 x 8 sparse dfm
... complete. Elapsed time: 0.047 seconds.
Document-feature matrix of: 1 document, 8 features.
1 x 8 sparse Matrix of class "dfmSparse"
features
docs dapagliflozin fortamet glucophage glumetza hydrochloride metformin mg pioglitazone
text1 2 1 1 1 5 5 4 2
Upvotes: 1
Reputation: 887501
We can use a combination of strsplit/unlist/table
. Split the column strings with strsplit
specifying the split
as space (\\s+
). The output will be a list
. Use unlist
to change the list to vector and then use table
to get the count.
table(unlist(strsplit(yourdf$Column1, '\\s+'))
Upvotes: 2