Reputation: 2446
So, we all know that sed is great at finding and replacing all occurrences of words in a file:
sed -i 's/original_word/new_word/g' file.txt
But, can someone show me how to feed sed a list of 'original_words' from a file (similar to grep -f)? I just want to replace all with '' (erase them).
The original wordlist file is just a bunch of stopwords separated by line (wordlist.txt):
a
about
above
according
across
after
afterwards
This would be an easy way to take a list of stopwords and nuke them from a corpus (useful for cleaning data).
The file.txt looks like
05ricardo RT @shakira: Immigration reform isn't about politics. It's about people mothers, kids. Obama is working for all of them. http://t.co/rAW ... 0
05ricardo ?@ItsReginaG: Don't vote Obama. Because you will lose jobs, and die.? Lol 0
05ricardo ?@shakira: Obama doubles Pell Grants - 700,000 more Latinos get help to go to college. Meet Johanny Adames http://t.co/EMg8NLGl Shak?. ? -1
05rodriguez_a My Comm teacher gave me a copy of Obama's speech that he gave the other night and I cried while reading it. It was that moving. -3
Upvotes: 3
Views: 3709
Reputation: 47189
You could also let sed write the sed-script for you (tested with GNU sed):
<stopwords sed 's:.*:s/\\b&\\b//:g' | sed -f - file.txt
Output:
05ricardo RT @shakira: Immigration reform isn't politics. It's about people mothers, kids. Obama is working for all of them. http://t.co/rAW ... 0
05ricardo ?@ItsReginaG: Don't vote Obama. Because you will lose jobs, and die.? Lol 0
05ricardo ?@shakira: Obama doubles Pell Grants - 700,000 more Latinos get help to go to college. Meet Johanny Adames http://t.co/EMg8NLGl Shak?. ? -1
05rodriguez_a My Comm teacher gave me copy of Obama's speech that he gave the other night and I cried while reading it. It was that moving. -3
Upvotes: 2
Reputation: 212474
First, not all sed
support -i
, but it's not a necessary option as it is trivial to provide that functionality in a general way. One simple option (assuming a non-csh family shell):
inline() { f=$1; shift; "$@" < $f > $f.out && mv $f.out $f; }
Then, to do the replacements (you haven't specified how you want to deal with word delimiters, so if "foo" is in the blacklist "bar foo baz" will end up with two spaces between "bar" and "baz") it is pretty straightforward with either awk or perl:
awk 'NR==FNR{a[$0]; next} {for( i in a ) gsub( i, "" )} 1' original-words file.txt
perl -wne 'if( $ARGV = $ARGV[0] ){ chop; push @no, $_; next }
foreach $x( @no ) {s/$x//g } print ' original-words file.txt
If you are happy with the results, either use -i
with perl
(not all sed
support -i
, but all perl
> 5.0) or you can modify the file with:
inline file.txt awk 'NR==FNR{a[$0]; next}
{for( i in a ) gsub( i, "" )} 1' original-words -
Either of these solutions will be substantially faster than invoking sed
for every word in the blacklist.
Upvotes: 1
Reputation: 54532
Here's one way using GNU sed
:
while IFS= read -r word; do sed -ri "s/( |)\b$word\b//g" file; done < wordlist
Contents of file:
how about I decide to look at it afterwards. What
across do you think? Is it a good idea to go out and about? I
think I'd rather go up and above.
Results:
how I decide to look at it. What
do you think? Is it good idea to go out and? I
think I'd rather go up and.
Upvotes: 1
Reputation: 1
Perhaps this
#!/bin/sh
while read k
do
sed -i "s/$k//g" file.txt
done < dict.txt
Upvotes: 0