Reputation: 21
file1 > word_list.txt > over 1,000,000 Lines
file2 > list.txt > over 1,000,000 Lines
I have a file containing a list of words. I want to remove all occurrences of all the words in this file from a big text file.
Example:
File 1
111
222
Text file sample
111
222
333
444
555
Output
333
444
555
This code be very slow for large files with over 1 million lines:
sed -e "$(sed 's:.*:s/&//ig:' word_list.txt)" list.txt
What is the most appropriate method for this problem?
Upvotes: 2
Views: 216
Reputation: 67467
assumptions, files are structured one word per each line, words are unique in each file, files can be sorted (or in sorted order already)
$ comm -13 file1 file2
333
444
555
-1 suppress lines unique to file1
-3 suppress lines that appear in both files
which will give you unique words in file2 which are not in file1 (that is set difference file2 \ file1)
This should be the fastest approach. Please post the timings if you can test alternative solutions.
Alternatively,
$ awk 'NR==FNR{a[$0]; next} !($0 in a)' file1 file2
should work as long as you have enough memory. This doesn't require sorting.
Upvotes: 1