Reputation: 2446

Using sed to remove words in a stopword list (Feeding sed a list of parameters to remove from a text file)

So, we all know that sed is great at finding and replacing all occurrences of words in a file:

sed -i 's/original_word/new_word/g' file.txt

But, can someone show me how to feed sed a list of 'original_words' from a file (similar to grep -f)? I just want to replace all with '' (erase them).

The original wordlist file is just a bunch of stopwords separated by line (wordlist.txt):

a
about
above
according
across
after
afterwards

This would be an easy way to take a list of stopwords and nuke them from a corpus (useful for cleaning data).

The file.txt looks like

05ricardo   RT @shakira: Immigration reform isn't about politics. It's about people mothers, kids. Obama is working for all of them. http://t.co/rAW ...    0
05ricardo   ?@ItsReginaG: Don't vote Obama. Because you will lose jobs, and die.? Lol   0
05ricardo   ?@shakira: Obama doubles Pell Grants - 700,000 more Latinos get help to go to college. Meet Johanny Adames http://t.co/EMg8NLGl Shak?. ?    -1
05rodriguez_a   My Comm teacher gave me a copy of Obama's speech that he gave the other night and I cried while reading it. It was that moving. -3

Upvotes: 3

Answers (5)

Thor

Reputation: 47189

You could also let sed write the sed-script for you (tested with GNU sed):

<stopwords sed 's:.*:s/\\b&\\b//:g' | sed -f - file.txt

Output:

05ricardo   RT @shakira: Immigration reform isn't  politics. It's about people mothers, kids. Obama is working for all of them. http://t.co/rAW ...    0
05ricardo   ?@ItsReginaG: Don't vote Obama. Because you will lose jobs, and die.? Lol   0
05ricardo   ?@shakira: Obama doubles Pell Grants - 700,000 more Latinos get help to go to college. Meet Johanny Adames http://t.co/EMg8NLGl Shak?. ?    -1
05rodriguez_a   My Comm teacher gave me  copy of Obama's speech that he gave the other night and I cried while reading it. It was that moving. -3

Upvotes: 2

alemol

Reputation: 8652

cat file.txt | grep  -vf wordlist.txt

Upvotes: -1

William Pursell

Reputation: 212474

First, not all sed support -i, but it's not a necessary option as it is trivial to provide that functionality in a general way. One simple option (assuming a non-csh family shell):

inline() { f=$1; shift; "$@" < $f > $f.out && mv $f.out $f; }

Then, to do the replacements (you haven't specified how you want to deal with word delimiters, so if "foo" is in the blacklist "bar foo baz" will end up with two spaces between "bar" and "baz") it is pretty straightforward with either awk or perl:

awk 'NR==FNR{a[$0]; next} {for( i in a ) gsub( i, "" )} 1' original-words file.txt
perl -wne 'if( $ARGV = $ARGV[0] ){ chop; push @no, $_; next } 
    foreach $x( @no ) {s/$x//g } print ' original-words file.txt

If you are happy with the results, either use -i with perl (not all sed support -i, but all perl > 5.0) or you can modify the file with:

inline file.txt awk 'NR==FNR{a[$0]; next} 
    {for( i in a ) gsub( i, "" )} 1' original-words -

Either of these solutions will be substantially faster than invoking sed for every word in the blacklist.

Upvotes: 1

Steve

Reputation: 54532

Here's one way using GNU sed:

while IFS= read -r word; do sed -ri "s/( |)\b$word\b//g" file; done < wordlist

Contents of file:

how about I decide to look at it afterwards. What
across do you think? Is it a good idea to go out and about? I 
think I'd rather go up and above.

Results:

how I decide to look at it. What
 do you think? Is it good idea to go out and? I 
think I'd rather go up and.

Upvotes: 1

Zombo

Reputation: 1

Perhaps this

#!/bin/sh
while read k
do
  sed -i "s/$k//g" file.txt
done < dict.txt

Upvotes: 0

Using sed to remove words in a stopword list (Feeding sed a list of parameters to remove from a text file)

Answers (5)

Related Questions