Reputation: 9417

Finding lines containing words that occur more than once using grep

How do I find all lines that contain duplicate lower case words. I want to be able to do this using egrep, this is what I've tried thus far but I keep getting invalid back references:

egrep '\<(.)\>\1' inputFile.txt
egrep -w '\b(\w)\b\1' inputFile.txt

For example, if I have the following file:

The sky was grey. 
The fall term went on and on.
I hope every one has a very very happy holiday.
My heart is blue.
I like you too too too much
I love daisies.

It should find the following lines in the file:

The fall term went on and on.
I hope every one has a very very happy holiday.
I like you too too too much

It finds these lines because the words on, very and too occur more than once in each line.

Upvotes: 1

Answers (4)

Jotne

Reputation: 41456

I know this is about grep, but here is an awk
It would be more flexible, since you can easy change to counter c
c==2 two equal words
c>2 two or more equals words
etc

awk -F"[ \t.,]" '{c=0;for (i=1;i<=NF;i++) a[$i]++; for (i in a) c=c<a[i]?a[i]:c;delete a} c==2' file
The fall term went on and on.
I hope every one has a very very happy holiday.

It runs a loop trough all words in a line and create an array index for every words.
Then a new loop to see if there is word that is repeated.

Upvotes: 1

Avinash Raj

Reputation: 174706

This could be possible through -E or -P parameter.

grep -E '(\b[a-z]+\b).*\b\1\b' file

Example:

$ cat file
The fall term went on and on.
I hope every one has a very very happy holiday.
Hi foo bar.
$ grep -E '(\b[a-z]+\b).*\b\1\b' file
The fall term went on and on.
I hope every one has a very very happy holiday.

Upvotes: 2