Reputation: 32306
I have a text file that I need to correct. The words found in the file "exclude.txt" should be removed from original text.
original.txt
<block-list:block block-list:abbreviated-name="tost" block-list:name="test" />
<block-list:block block-list:abbreviated-name="tast" block-list:name="tart"/>
<block-list:block block-list:abbreviated-name="werk" block-list:name="work" />
<block-list:block block-list:abbreviated-name="wark" block-list:name="wrok" />
<block-list:block block-list:abbreviated-name="teble" block-list:name="table" />
<block-list:block block-list:abbreviated-name="tetal" block-list:name="total" />
<block-list:block block-list:abbreviated-name="exet" block-list:name="exit" />
The exclude file looks like this...
exclude.txt
tart
wrok
The expected output will look like this...
final.txt
<block-list:block block-list:abbreviated-name="tost" block-list:name="test" />
<block-list:block block-list:abbreviated-name="werk" block-list:name="work" />
<block-list:block block-list:abbreviated-name="teble" block-list:name="table" />
<block-list:block block-list:abbreviated-name="tetal" block-list:name="total" />
<block-list:block block-list:abbreviated-name="exet" block-list:name="exit" />
This grep command is working as expected.
grep -v -E 'tart|wrok' original.txt
This is OK if I have only 2 or 3 words in exclude file. But the problem is that both the original and exclude files have millions of words.
Update:
I forgot to mention that I have this line in original.txt
<block-list:block block-list:abbreviated-name="tart" block-list:name="test"/>
And I want to keep this line in original file because even if the wrong word "tart" is there, it is not in "block-list:name".
Update:
The include file has 15 million words compared to exclude file (15 thousand)
include.txt
test
work
table
total
exit
The awk and grep + sed commands are killed. I will prefer to use include file instead of exclude file (if possible).
Upvotes: 1
Views: 114
Reputation: 784998
You may use this grep + sed
solution in bash
:
grep -vFf <(sed 's/.*/block-list:name="&"/' exclude.txt) original.txt
<block-list:block block-list:abbreviated-name="tost" block-list:name="test" />
<block-list:block block-list:abbreviated-name="werk" block-list:name="work" />
<block-list:block block-list:abbreviated-name="teble" block-list:name="table" />
<block-list:block block-list:abbreviated-name="tetal" block-list:name="total" />
<block-list:block block-list:abbreviated-name="exet" block-list:name="exit" />
sed 's/.*/block-list:name="&"/' exclude.txt
is used to wrap each word in exclude.txt
with block-list:name="<word>"
grep -vFf
is used to match all non-matching lines from original.txt
with the patterns coming from a process substitution ``<(....)that runs
sed` command.PS: Based on the edited question, this solution only ignore block-list:name="blocked-word"
in original file.
Upvotes: 1
Reputation: 37394
Using awk and "
a delimiter, so basically every even numbered field is a word (blabla"word"blalbla"another_word"...
):
$ awk -F\" 'NR==FNR{a[$1];next}!($4 in a)' exclude original
Output:
<block-list:block block-list:abbreviated-name="tost" block-list:name="test" />
<block-list:block block-list:abbreviated-name="werk" block-list:name="work" />
<block-list:block block-list:abbreviated-name="teble" block-list:name="table" />
<block-list:block block-list:abbreviated-name="tetal" block-list:name="total" />
<block-list:block block-list:abbreviated-name="exet" block-list:name="exit" />
Edit: Just noticed I want to compare words only in "block-list:name". The placeholder is important in the commants so I changed the !($2 in a)&&!($4 in a)
to !($4 in )
. If the placement of block-list:name
varies, use:
$ awk '
NR==FNR { # process the exclude file
a[$1] # hash word
next
}
{ # process the original file
for(i=1;i<=NF;i++) # loop every spave separated string
if($i~/^block-list:name=/) { # when we meet the desired string
t=$i # copy string to temp var
gsub(/^[^"]+"|".*/,"",t) # extract the word
if(!(t in a)) # if the word is not to be excluded
print # output record
next # move the next record anyway
}
}' exclude original
Upvotes: 1