Petra
Petra

Reputation: 1

grep: comparing a list of one ore more words in each line with a text file

I am working on a Debian/GNU Linux OS and like to use a short shell command (terminal or extern script).

My aim: I have a list of words in foo.txt like

---- foo.txt ----

dog
cat
mouse with hat

---- /foo.txt ----

and want to compare this list with bar.txt (implying normal text with some paragraphs).

I would like to have two kind of matches:

  1. all words of each line should match (e.g. 'mouse with hat' as well as just 'hat')

  2. only the the first appearance of each whole line should match

Related to the first problem:

My first code (so far for the command line) and my problems:

for i in foo.txt; do fgrep -f foo.txt bar.txt

just matches the first word of the list. Now I think I have to use something like

for i in foo.txt; do fgrep -e <some-kind-of-regexp> -f foo.txt bar.txt

but I am bogged down with the regexp :(

Related to the second problem For stopping grep I only know the -m option.

for i in foo.txt; do fgrep -m 1 -f foo.txt bar.txt

stops after the first with any matches. But I like to have something like 'search for any first match and stop after browsing the whole list'.

Upvotes: 0

Views: 636

Answers (1)

Bacon
Bacon

Reputation: 2195

To your first question, you need to split up the list into its individual words before you give it to grep. I use awk for this, but you could probably use sed too. I am splitting on whitespace, but you could just as easily split on non-alphanumerics if that's what you wanted:

fgrep -f <(mawk 'BEGIN{FS=" "}{print; if(NF > 1)for(i=1; i<=NF; i++)print $i}' foo.txt) bar.txt

To your second question, you need to get a little fancy. First, output the line number along with each matched string, then you can unique on the matched string to get the line number that matched for each string.

cat bar.txt \
| mawk '{print NR,$0}' \
| join -1 1 -2 1 - <(fgrep -o -n -f <(mawk 'BEGIN{FS=" "}{print; if(NF > 1)for(i=1; i<=NF; i++)print $i}' foo.txt) bar.txt \
| sort -k2,2 -k1,1n \
| sort -k2,2 -us \
| cut -f1 \
| sort -k1,1)

Upvotes: 1

Related Questions