Ezra Polson
Ezra Polson

Reputation: 235

Word count for text in column

I have a dataset with a column containing text as follows

    Column1
    ----------------------------------------------------------
    dapagliflozin 10 MG / metFORMIN hydrochloride 
    dapagliflozin 5 MG / metFORMIN hydrochloride  
    Fortamet       
    Glucophage      
    Glumetza      
    metFORMIN hydrochloride      
    metFORMIN hydrochloride  / pioglitazone 15 MG     
    metFORMIN hydrochloride  / pioglitazone 30 MG      

I am trying to obtain the word count for every unique word, for example, word count for metFormin, word count for hydrochloride, etc. I need help; I tried table function, but it uses the whole row as one word and that's not helpful.

Upvotes: 1

Views: 1610

Answers (2)

Ken Benoit
Ken Benoit

Reputation: 14902

Or use a text analysis package designed for this:

> require(quanteda)
> dfm(myColumn)
Creating a dfm from a character vector ...
   ... lowercasing
   ... tokenizing
   ... indexing 1 document
   ... shaping tokens into data.table, found 21 total tokens
   ... summing tokens by document
   ... indexing 8 feature types
   ... building sparse matrix
   ... created a 1 x 8 sparse dfm
   ... complete. Elapsed time: 0.047 seconds.
Document-feature matrix of: 1 document, 8 features.
1 x 8 sparse Matrix of class "dfmSparse"
       features
docs    dapagliflozin fortamet glucophage glumetza hydrochloride metformin mg pioglitazone
  text1             2        1          1        1             5         5  4            2

Upvotes: 1

akrun
akrun

Reputation: 887501

We can use a combination of strsplit/unlist/table. Split the column strings with strsplit specifying the split as space (\\s+). The output will be a list. Use unlist to change the list to vector and then use table to get the count.

 table(unlist(strsplit(yourdf$Column1, '\\s+'))

Upvotes: 2

Related Questions