Reputation: 169

Find lines in a file that has only words in a list

Here is file1.txt:

.apple .ball .cow
.apple .cow .tea .mine.nice
.mine.nice
.tea
.zebra

Here file2.txt

.apple
.mine.nice
.cow
.tea

Expected Result:

.apple .cow .tea .mine.nice
.mine.nice
.tea

while using following does not give expected result

grep -w -F -f file2.txt file1.txt

gives

.apple .ball .cow
.apple .cow .tea .mine.nice
.mine.nice
.tea

How to get expected result?

Upvotes: 2

Answers (3)

qi yuan

Reputation: 11

If you can accept the comm tool (it is a really simple tool), you can do it like this.

For each line in file1.txt, you can get the words only exists in that line but not file2.txt by comm tool with -2 -3 params. if the output is not empty. then print the line.

cat file1.txt | xargs -I {} bash -c 'if [[ -z $(comm -2 -3 <(echo {} | sed "s/\s\+/\n/g" | sort) <(sort file2.txt)) ]]; then echo {}; fi'

Upvotes: 0

potong

Reputation: 58473

This might work for you (GNU sed):

sed -En '1{x;s/.*/cat file2/e;y/\n/ /;s/$/ /;x}
         s/.*/& \n&/;G
         :a;s/^(\S+ )(.*\n.*\n.*\1)/\2/;ta;s/^\n(.*)\n.*/\1/p' file1

The solution juggles three lines in the pattern space, two copies of the current line and the contents of file2. The first copy of the current line is matched against the strings in file2 and reduced in size until there are no more matches. If the result of the matching produces an empty line, the matches were successful and the line is printed otherwise it is discarded. The flow of processing is as follows:

Prime the hold space with the contents of file2, replace newlines by spaces and append a space for pattern matching purposes.

Double the current line, again adding a space to the first copy,separate the copies by newlines and append the hold space.

Iterate through the strings at the front of the first copy of the current line, removing it if it matches in file2.

When there are no more matches, if all that is left is the newline separating the copies then print the unadulterated copy of the current line.

Otherwise the current line did not match the strings in file2 and no output is produced for that line.

Upvotes: 1

Daweo

Reputation: 36630

I would exploit GNU AWK next for this task following way, let file1.txt content be

.apple .ball .cow
.apple .cow .tea .mine.nice
.mine.nice
.tea
.zebra

and file2.txt content be

.apple
.mine.nice
.cow
.tea

then

awk 'NR==FNR{arr[$1];next}{for(i=1;i<=NF;i+=1){if(!($i in arr)){next}};print}' file2.txt file1.txt

gives

.apple .cow .tea .mine.nice
.mine.nice
.tea

Explanation: during processing 1st file of mentioned (note that this is file2.txt) i.e. where number of row is equal number of row of current file (NR==FNR) ask about key being 1st file of array arr. This cause creating key in array, I do not specify any value and it is irrelevant for future. After doing that go to next line, i.e. do not do anything else during processing 1st file. For all but 1st line iterate over fields using for loop, if you encounter field which is not one of keys of array arr go to next line, after processing all fields print whole line as is. Note that this code short-circuit i.e. go to next line as soon as 1st not allowed word is detected. Disclaimer: I assume that file2.txt is holding exactly 1 word per line.

(tested in gawk 4.2.1)

Upvotes: 2

Find lines in a file that has only words in a list

Answers (3)

Related Questions