Reputation: 123
I am a researcher and have about 17,000 free-text documents of which around 30-40% are associated with my outcome. Is there an open-source tool I can use to determine the most common words (or even phrases, but not necessary) that are associated with the outcome, normalizing for the frequency of words that are already occurring? All of the documents are written by health care workers, so it will be important to normalize since there will be technical language across both documents and also would want to screen out words like "the", "it", etc.
What I am trying to do is build a tool using regular expressions or NLP that will then use these words to identify the outcome based on new documents. I'm not planning on spending a huge amount of time customizing an NLP tool, so something with reasonable accuracy is good enough.
I know SAS, SQL (am using postgreSQL), and Python, but could potentially get by in R. I haven't done any NLP before. Is there any software I could use that doesn't have too steep of a learning curve? Thanks!
Upvotes: 4
Views: 2604
Reputation: 14701
tool I can use to determine the most common words...
... so something with reasonable accuracy is good enough.
I suggest try using unix text tools first. From coursera Natural Language Processing course Word Tokenization Lesson, Youtube link is here. A simple tutorial here.
We use tr, uniq and sort for this purpose. If you used unix text tools before, this is full command here.
tr -sc 'A-Z' 'a-z' < *.txt | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r
Otherwise below is explanation of every part.
tr -sc 'A-Za-z' '\n' < filename.txt
This command take filename.txt change every word , essentially you add new line after every word.
tr -sc 'A-Za-z' '\n' < *.txt
Same as above but all txt files in your directory.
tr -sc 'A-Za-z' '\n' < *.txt | sort
Pipe your command to sort. First will start with a lot of "a" word.
tr -sc 'A-Za-z' '\n' < *.txt | sort | uniq -c
Pipe sort result to uniq command and count it.
tr -sc 'A-Za-z' '\n' < *.txt | sort | uniq -c | sort -n -r
Pipe your command again sort to see most used , that most Common words.
Problem here:'and' and 'And' counted twice
tr -sc 'A-Z' 'a-z' < *.txt | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r
or
tr '[:upper:]' '[:lower:]' < *.txt | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r
Change all your words to lowercase and same pipe again. This will get you most common words in your files.
Upvotes: 2
Reputation: 5940
NLP is certainly not easy and perhaps not really required in this particular case. With regards to normalisation, perhaps tf-idf would be sufficient?
Upvotes: 0
Reputation: 3443
You can find links to some useful R packages here:
http://cran.r-project.org/web/views/NaturalLanguageProcessing.html
Upvotes: -1
Reputation: 1764
GATE(General Architecture of Text Engineering) is a helpful tool here.
Making annotations and composing phrases with Annotations over corpus via GUI tool and then run the Java Annotation Patterns Engine (JAPE) is highly helpful for this.
http://gate.ac.uk/sale/tao/splitch8.html#chap:jape
and
http://gate.ac.uk/sale/thakker-jape-tutorial/GATE%20JAPE%20manual.pdf
or
are helpful links which you can view.
We have experienced our Signs & Symptoms extraction system from medical corpus using help of this tool in one of our applications.
Thanks.
Upvotes: 0