Test if each line in a file contains one of multiple strings in another file

Question

I have a text file (we'll call it keywords.txt) that contains a number of strings that are separated by newlines (though this isn't set in stone; I can separate them with spaces, commas or whatever is most appropriate). I also have a number of other text files (which I will collectively call input.txt).

What I want to do is iterate through each line in input.txt and test whether that line contains one of the keywords. After that, depending on what input file I'm working on at the time, I would need to either copy matching lines in input.txt into output.txt and ignore non-matching lines or copy non-matching lines and ignore matching.

I searched for a solution but, though I found ways to do parts of what I'm trying to do, I haven't found a way to do everything I'm asking for here. While I could try and combine the various solutions I found, my main concern is that I would end up wondering if what I coded would be the best way of doing this.

This is a snippet of what I currently have in keywords.txt:

google
adword
chromebook.com
cobrasearch.com
feedburner.com
doubleclick
foofle.com
froogle.com
gmail
keyhole.com
madewithcode.com

Here is an example of what can be found in one of my input.txt files:

&expandable_ad_
&forceadv=
&gerf=*&guro=
&gIncludeExternalAds=
&googleadword=
&img2_adv=
&jumpstartadformat=
&largead=
&maxads=
&pltype=adhost^

In this snippet, &googleadword= is the only line that would match the filter and there are scenarios in my case where output.txt will either have only the matching line inserted or every line that doesn't match the keywords.

tobias · Accepted Answer

1. Assuming the content of keywords.txt is separated by newlines:

google
adword
chromebook.com
...

The following will work:

# Use keywords.txt as your pattern & copy matching lines in input.txt to output.txt
grep -Ff keywords.txt input.txt > output.txt

# Use keywords.txt as your pattern & copy non-matching lines in input.txt to output.txt
grep -vFf keywords.txt input.txt > output.txt

2. Assuming the content of keywords.txt is separated by vertical bars:

google|adword|chromebook.com|...

The following will work:

# Use keywords.txt as your pattern & copy matching lines in input.txt to output.txt
grep -Ef keywords.txt input.txt > output.txt

# Use keywords.txt as your pattern & copy non-matching lines in input.txt to output.txt
grep -vEf keywords.txt input.txt > output.txt

3. Assuming the content of keywords.txt is separated by commas:

google,adword,chromebook.com,...

There are many ways of achieving the same, but a simple way would be to use tr to replace all commas with vertical bars and then interpret the pattern with grep's extended regular expression.

# Use keywords.txt as your pattern & copy matching lines in input.txt to output.txt
grep -E $(tr ',' '|' < keywords.txt) input.txt > output.txt

# Use keywords.txt as your pattern & copy non-matching lines in input.txt to output.txt
grep -vE $(tr ',' '|' < keywords.txt) input.txt > output.txt

Grep Options

 -v, --invert-match
       Selected lines are those not matching any of the specified patterns.   

 -F, --fixed-strings
       Interpret each data-matching pattern as a list of fixed strings, 
       separated by newlines, instead of as a regular expression.

 -E, --extended-regexp
       Interpret pattern as an extended regular expression
       (i.e. force grep to behave as egrep).

 -f file, --file=file
       Read one or more newline separated patterns from file.
       Empty pattern lines match every input line.
       Newlines are not considered part of a pattern.
       If file is empty, nothing is matched.

Test if each line in a file contains one of multiple strings in another file

Answers (1)

Related Questions