Reputation: 3060
I have about 200MB of text file (rawtext.txt) and have a list of stop words in a text file (stopwords.txt).
I
a
about
an
are
as
at
be
by
com
for
...
I want to remove the stopwords in the text corpus. But how? What is the fastest and easiest way? Prefer a command line like sed or tr. Dont want to use python or NLTK.
Can somebody help? I am using Mac OSX (not linux)
Upvotes: 1
Views: 1867
Reputation: 3558
A working solution (also in Mac OS):
cat rawtext.txt | grep -o -E '[a-zA-Z]{3,}' | tr '[:upper:]' '[:lower:]' | sort | uniq | grep -vwFf stopwords.txt
This would extract just the 3-letter words (without numbers), convert to lowercase, sort and get uniques, then filter with the stop words.
Make sure stopwords.txt was treated in the same way (e.g. lowercase).
Upvotes: 1
Reputation: 50220
Convert your input to word-per-line format, and you can filter it with grep
:
tr -s '[:blank:]' '\n' < rawtext.txt | fgrep -vwf stopwords.txt
This way you don't have to build an arbitrarily large regexp, which could be a problem if your stopwords list is large.
Upvotes: 1