Vicky
Vicky

Reputation: 1328

Printing lines with duplicate words

I am trying to print all line that can contain same word twice or more

E.g. with this input file:

cat dog cat
dog cat deer
apple peanut banana  apple
car bus train plane
car train car train

Output should be

cat dog cat
apple peanut banana  apple
car train car train.

I have tried this code and it works but I think there must be a shorter way.

awk '{ a=0;for(i=1;i<=NF;i++){for(j=i+1;j<=NF;j++){if($i==$j)a=1} } if( a==1 ) print $0}'

Later I want to find all such duplicate words and delete all the duplicate entries except for 1st occurrence.

So input:

cat dog cat lion cat 
dog cat deer
apple peanut banana  apple
car bus train plane
car train car train

Desired output:

cat dog lion 
dog cat deer
apple peanut banana  
car bus train plane
car train

Upvotes: 2

Views: 2885

Answers (4)

Ruslan Osmanov
Ruslan Osmanov

Reputation: 21492

I'll show solutions in Perl as it is probably the most flexible tool for text parsing, especially when it comes to regular expressions.

Detecting Duplicates

perl -ne 'print if m{\b(\S+)\b.*?(\b\1\b)}g' file

where

  • -n causes Perl to execute the expression passed via -e for each input line;
  • \b matches word boundaries;
  • \S+ matches one or more non-space characters;
  • .*? is a non-greedy match for zero or more characters;
  • \1 is a backreference to the first group, i.e. the word \S+;
  • g globally matches the pattern repeatedly in the string.

Removing Duplicates

perl -pe '1 while (s/\b(\S+)\b.*?\K(\s\1\b)//g)' file

where

  • -p causes Perl to print the line ($_), like sed;
  • 1 while loop runs as long as the substitution replaces something;
  • \K keeps the part matching the previous expression;

Duplicate words (\s\1\b) are replaced with empty string (//g).

Why Perl?

Perl regular expressions are known to be very flexible, and regular expressions in Perl are actually more than just regular expressions. For example, you can embed Perl code into the substitution using the /e modifier. You can use the /x modifier that allows to write regular expressions in a more readable format and even use Perl comments in it, e.g.:

perl -pe '1 while (
  s/            # Begins substitution: s/pattern/replacement/flags
  \b (\S+) \b   # A word
  .*?           # Ungreedy pattern for any number of characters
  \K            # Keep everything that matched the previous patterns
  (             # Group for the duplicate word:
    \s          #   - space
    \1          #   - backreference to the word
    \b          #   - word boundary
  )
  //xg
)' file

As you should have noticed, the \K anchor is very convenient, but is not available in many popular tools including awk, bash, and sed.

Upvotes: 0

Bill Karwin
Bill Karwin

Reputation: 562388

Here's a solution for printing only lines that contain duplicate words.

awk '{
  delete seen
  for (i=1;i<=NF;++i) {
    if (seen[$i]) { print ; next }
    seen[$i] = 1 
  }
}'

Here's a solution for deleting duplicate words after the first.

awk '{
  delete seen
  for (i=1;i<=NF;++i) {
    if (seen[$i]) { continue }
    printf("%s ", $i);
    seen[$i] = 1 
  }
  print "";
}'

Re your comment...

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. — Jamie Zawinski, 1997

Upvotes: 2

Lars Fischer
Lars Fischer

Reputation: 10149

You can use this GNU sed command:

sed -rn '/(\b\w+\b).*\b\1\b/ p' yourfile
  • -r activate extended re and n deactivates the implicit printing of every line
  • the p command then prints only lines that match the preceding re (inside the slashes):
    • \b\w+\b are words : an nonemtpy sequence of word charactes (\w) between word boundaries (\b`), these are GNU extensions
    • such a word is "stored" in \1 for later reuse, due to the use of parentheses
    • then we try to match this word with \b\1\b again with something optional (.*) between those two places.
    • and that is the whole trick: match something, put it in parentheses so you can reuse it in the same re with \1

To answer the second part of the question, deleting the doubled words after the first, but print all lines (modifying only the lines with doubled words), you could use some sed s magic:

sed -r ':A s/(.*)(\b\w+\b)(.*)\b\2\b(.*)/\1\2\3\4/g; t A ;'
  • here we use again the backreference trick.
  • but we have to account for the things before, between and after our doubled words, thus we have a \2 in the matching part of then s command and we have the other backreferences in the replacement part.
  • notice that only the \2 has no parens in the matching part and we use all groups in the replacement, thus we effectively deleted the second word of the pair.
  • for more repetitions of the word we need loop:
    • :A is a label
    • t A jumps to the label if there was a replacement done in the last s comamnd
    • this builds a "while loop" around the s to delete the other repetitions, too

Upvotes: 3

hek2mgl
hek2mgl

Reputation: 158030

With egrep you can use a so called back reference:

egrep '(\b\w+\b).*\b\1\b' file

(\b\w+\b) matches a word at word boundaries in capturing group 1. \1 references that matched word in the pattern.

Upvotes: 1

Related Questions