pbu
pbu

Reputation: 3060

Removing stopwords from text corpus using linux commandline

I have about 200MB of text file (rawtext.txt) and have a list of stop words in a text file (stopwords.txt).

I
a
about
an
are
as
at
be
by
com
for

...

I want to remove the stopwords in the text corpus. But how? What is the fastest and easiest way? Prefer a command line like sed or tr. Dont want to use python or NLTK.

Can somebody help? I am using Mac OSX (not linux)

Upvotes: 1

Views: 1867

Answers (2)

Roy Shilkrot
Roy Shilkrot

Reputation: 3558

A working solution (also in Mac OS):

cat rawtext.txt | grep -o -E '[a-zA-Z]{3,}' | tr '[:upper:]' '[:lower:]' | sort | uniq | grep -vwFf stopwords.txt

This would extract just the 3-letter words (without numbers), convert to lowercase, sort and get uniques, then filter with the stop words.

Make sure stopwords.txt was treated in the same way (e.g. lowercase).

Upvotes: 1

alexis
alexis

Reputation: 50220

Convert your input to word-per-line format, and you can filter it with grep:

tr -s '[:blank:]' '\n' < rawtext.txt | fgrep -vwf stopwords.txt 

This way you don't have to build an arbitrarily large regexp, which could be a problem if your stopwords list is large.

Upvotes: 1

Related Questions