Fast shell command to remove stop words in a text file

Question

I have a 2GB text file. I am trying to remove frequently occurring english stop words from this file.

I have stopwords.txt containing like this..

a
an
the
for
and
I

What is the fast method to do this using shell command such as tr, sed or awk?

Ken Benoit · Accepted Answer

Here's a method using the command line and perl:

Save the text below as replacesw.sh:

#! /bin/bash
MYREGEX=\b$`perl -pe 's/\n/|/g' $1`$\b
perl -pe "s/$MYREGEX//g" $2

Then if you have saved your file above as stopwords.txt, and have a second file (e.g.) called testtext.txt that contains:

This is a file with the stopwords from the stopwords.txt for testing.
More than one line in the file, for a better test.

Then the following at the command line will remove the stopwords:

KBs-MBP13:temp kbenoit$ ./replacesw.sh stopwords.txt testtext.txt 
This is  file with  stopwords from  stopwords.txt  testing.
More than one line in  file,   better test.

You might need to chmod u+x replacesw.sh first.

Fast shell command to remove stop words in a text file

Answers (1)

Related Questions