Reputation: 21
I have two files with two single-column lists:
//file1 - full list of unique values
AAA
BBB
CCC
//file2
AAA
AAA
BBB
BBB
//So the result here would be:
CCC
I need to generate a list of values from file1 that have no matches in file2. I have to use bash script (preferably without special tools like awk) or DOS batch file.
Thank you.
Upvotes: 2
Views: 4280
Reputation: 57418
Looks like a job for grep
's -v flag.
grep -v -F -f listtocheck uniques
A variation to Drake Clarris's solution (that can be extended to checking using several files, which grep
can't do unless they are first merged), would be:
(
sort < file_to_check | uniq
cat reference_file reference_file
) | sort | uniq -u
By doing this, any words in file_to_check
will appear, in the output combined by the subshell in brackets, only once. Words in reference_file
will be output at least twice, and words appearing in both files will be output at least three times - one from the first file, twice from the two copies of the second file.
There only remains to find a way to isolate the words we want, those that appear once, which is what sort | uniq -u
does.
If reference_file
contains a lot of duplicates, it might be worthwhile to run a heavier
sort < reference_file | uniq
sort < reference_file | uniq
instead of cat reference_file reference_file
, in order to have a smaller output and weigh less on the final sort
.
This would be even faster if we used temporary files, since merging already-sorted files can be done efficiently (and in case of repeated checks with different files, we could reuse again and again the same sorted reference file without need of re-sorting it); therefore
sort < file_to_check | uniq > .tmp.1
sort < reference_file | uniq > .tmp.2
# "--merge" works way faster, provided we're sure the input files are sorted
sort --merge .tmp.1 .tmp.2 .tmp.2 | uniq -u
rm -f .tmp.1 .tmp.2
Finally in case of very long runs of identical lines in one file, which may be the case with some logging systems for example, it may be also worthwhile to run uniq
twice, one to get rid of the runs (ahem) and another to uniqueize it, since uniq
works in linear time while sort
is linearithmic.
uniq < file | sort | uniq > .tmp.1
Upvotes: 4
Reputation: 130919
For a Windows CMD solution (commonly referred to as DOS, but not really):
It should be as simple as
findstr /vlxg:"file2" "file1"
but there is a findstr bug that results in possible missing matches when there are multiple literal search strings.
If a case insensitive search is acceptable, then adding the /I
option circumvents the bug.
findstr /vlixg:"file2" "file1"
If you are not restricted to native Windows commands then you can download a utility like grep for Windows. The Gnu utilities for Windows are a good source. Then you could use Isemi's solution on both Windows and 'nix.
It is also easy to write a VBScript or JScript solution for Windows.
Upvotes: 2