john77222
john77222

Reputation: 21

How to remove lines from a text file words in a list?

file1 > word_list.txt > over 1,000,000 Lines

file2 > list.txt > over 1,000,000 Lines

I have a file containing a list of words. I want to remove all occurrences of all the words in this file from a big text file.

Example:

File 1

111
222

Text file sample

111
222
333
444
555

Output

333
444
555

This code be very slow for large files with over 1 million lines:

sed -e "$(sed 's:.*:s/&//ig:' word_list.txt)" list.txt

What is the most appropriate method for this problem?

Upvotes: 2

Views: 216

Answers (1)

karakfa
karakfa

Reputation: 67467

assumptions, files are structured one word per each line, words are unique in each file, files can be sorted (or in sorted order already)

$ comm -13 file1 file2

333
444
555

-1   suppress lines unique to file1
-3   suppress lines that appear in both files 

which will give you unique words in file2 which are not in file1 (that is set difference file2 \ file1)

This should be the fastest approach. Please post the timings if you can test alternative solutions.

Alternatively,

$ awk 'NR==FNR{a[$0]; next} !($0 in a)' file1 file2

should work as long as you have enough memory. This doesn't require sorting.

Upvotes: 1

Related Questions