pbu
pbu

Reputation: 3060

Fast shell command to remove stop words in a text file

I have a 2GB text file. I am trying to remove frequently occurring english stop words from this file.

I have stopwords.txt containing like this..

a
an
the
for
and
I

What is the fast method to do this using shell command such as tr, sed or awk?

Upvotes: 3

Views: 1022

Answers (1)

Ken Benoit
Ken Benoit

Reputation: 14902

Here's a method using the command line and perl:

Save the text below as replacesw.sh:

#! /bin/bash
MYREGEX=\\b\(`perl -pe 's/\n/|/g' $1`\)\\b
perl -pe "s/$MYREGEX//g" $2

Then if you have saved your file above as stopwords.txt, and have a second file (e.g.) called testtext.txt that contains:

This is a file with the stopwords from the stopwords.txt for testing.
More than one line in the file, for a better test.

Then the following at the command line will remove the stopwords:

KBs-MBP13:temp kbenoit$ ./replacesw.sh stopwords.txt testtext.txt 
This is  file with  stopwords from  stopwords.txt  testing.
More than one line in  file,   better test.

You might need to chmod u+x replacesw.sh first.

Upvotes: 2

Related Questions