exclude regular expression and process very large files

Question

I have a text file that I need to correct. The words found in the file "exclude.txt" should be removed from original text.

original.txt

The exclude file looks like this...

exclude.txt
tart
wrok

The expected output will look like this...

final.txt

This grep command is working as expected.

grep -v -E 'tart|wrok' original.txt

This is OK if I have only 2 or 3 words in exclude file. But the problem is that both the original and exclude files have millions of words.

Update:

I forgot to mention that I have this line in original.txt

And I want to keep this line in original file because even if the wrong word "tart" is there, it is not in "block-list:name".

Update:

The include file has 15 million words compared to exclude file (15 thousand)

include.txt
test
work
table
total
exit

The awk and grep + sed commands are killed. I will prefer to use include file instead of exclude file (if possible).

James Brown · Accepted Answer

Using awk and " a delimiter, so basically every even numbered field is a word (blabla"word"blalbla"another_word"...):

$ awk -F\" 'NR==FNR{a[$1];next}!($4 in a)' exclude original

Output:

Edit: Just noticed I want to compare words only in "block-list:name". The placeholder is important in the commants so I changed the !($2 in a)&&!($4 in a) to !($4 in ). If the placement of block-list:name varies, use:

$ awk '
NR==FNR {                             # process the exclude file
    a[$1]                             # hash word
    next
}
{                                     # process the original file
    for(i=1;i<=NF;i++)                # loop every spave separated string
        if($i~/^block-list:name=/) {  # when we meet the desired string
            t=$i                      # copy string to  temp var
            gsub(/^[^"]+"|".*/,"",t)  # extract the word
            if(!(t in a))             # if the word is not to be excluded
                print                 # output record
            next                      # move the next record anyway
        }
}' exclude original

exclude regular expression and process very large files

Answers (2)

Related Questions