Reputation: 3060
I have a 2GB text file. I am trying to remove frequently occurring english stop words from this file.
I have stopwords.txt containing like this..
a
an
the
for
and
I
What is the fast method to do this using shell command such as tr, sed or awk?
Upvotes: 3
Views: 1022
Reputation: 14902
Here's a method using the command line and perl
:
Save the text below as replacesw.sh
:
#! /bin/bash
MYREGEX=\\b\(`perl -pe 's/\n/|/g' $1`\)\\b
perl -pe "s/$MYREGEX//g" $2
Then if you have saved your file above as stopwords.txt
, and have a second file (e.g.) called testtext.txt
that contains:
This is a file with the stopwords from the stopwords.txt for testing.
More than one line in the file, for a better test.
Then the following at the command line will remove the stopwords
:
KBs-MBP13:temp kbenoit$ ./replacesw.sh stopwords.txt testtext.txt
This is file with stopwords from stopwords.txt testing.
More than one line in file, better test.
You might need to chmod u+x replacesw.sh
first.
Upvotes: 2