Reputation: 223

Text Mining - most common words, normalized

I am a researcher and have about 17,000 free-text documents of which around 30-40% are associated with my outcome. Is there an open-source tool I can use to determine the most common words (or even phrases, but not necessary) that are associated with the outcome, normalizing for the frequency of words that are already occurring? All of the documents are written by health care workers, so it will be important to normalize since there will be technical language across both documents and also would want to screen out words like "the", "it", etc.

What I am trying to do is build a tool using regular expressions or NLP that will then use these words to identify the outcome based on new documents. I'm not planning on spending a huge amount of time customizing an NLP tool, so something with reasonable accuracy is good enough.

I know SAS, SQL (am using postgreSQL), and Python, but could potentially get by in R. I haven't done any NLP before. Is there any software I could use that doesn't have too steep of a learning curve? Thanks!

Upvotes: 4

Answers (4)

Atilla Ozgur

Reputation: 14721

  tool I can use to determine the most common words... 
  ... so something with reasonable accuracy is good enough.

I suggest try using unix text tools first. From coursera Natural Language Processing course Word Tokenization Lesson, Youtube link is here. A simple tutorial here.

We use tr, uniq and sort for this purpose. If you used unix text tools before, this is full command here.

 tr -sc 'A-Z' 'a-z'  < *.txt | tr -sc 'A-Za-z' '\n'  | sort | uniq -c | sort -n -r

Otherwise below is explanation of every part.

tr -sc 'A-Za-z' '\n' < filename.txt

This command take filename.txt change every word , essentially you add new line after every word.

tr -sc 'A-Za-z' '\n' < *.txt

Same as above but all txt files in your directory.

tr -sc 'A-Za-z' '\n' < *.txt | sort

Pipe your command to sort. First will start with a lot of "a" word.

tr -sc 'A-Za-z' '\n' < *.txt | sort | uniq -c

Pipe sort result to uniq command and count it.

tr -sc 'A-Za-z' '\n' < *.txt | sort | uniq -c | sort -n -r

Pipe your command again sort to see most used , that most Common words.

Problem here:'and' and 'And' counted twice

tr -sc 'A-Z' 'a-z'  < *.txt | tr -sc 'A-Za-z' '\n'  | sort | uniq -c | sort -n -r

tr '[:upper:]' '[:lower:]' < *.txt | tr -sc 'A-Za-z' '\n'  | sort | uniq -c | sort -n -r

Change all your words to lowercase and same pipe again. This will get you most common words in your files.

Upvotes: 2

Lukas Vermeer

Reputation: 5940

NLP is certainly not easy and perhaps not really required in this particular case. With regards to normalisation, perhaps tf-idf would be sufficient?

Upvotes: 0

rmalouf

Reputation: 3443

You can find links to some useful R packages here:

http://cran.r-project.org/web/views/NaturalLanguageProcessing.html

Upvotes: -1

Piyas De

Reputation: 1764

GATE(General Architecture of Text Engineering) is a helpful tool here.

Making annotations and composing phrases with Annotations over corpus via GUI tool and then run the Java Annotation Patterns Engine (JAPE) is highly helpful for this.

http://gate.ac.uk/sale/tao/splitch8.html#chap:jape

and

http://gate.ac.uk/sale/thakker-jape-tutorial/GATE%20JAPE%20manual.pdf

http://gate.ac.uk

are helpful links which you can view.

We have experienced our Signs & Symptoms extraction system from medical corpus using help of this tool in one of our applications.

Thanks.

Upvotes: 0

Text Mining - most common words, normalized

Answers (4)

Related Questions