Reputation: 1328
I am trying to print all line that can contain same word twice or more
E.g. with this input file:
cat dog cat
dog cat deer
apple peanut banana apple
car bus train plane
car train car train
Output should be
cat dog cat
apple peanut banana apple
car train car train.
I have tried this code and it works but I think there must be a shorter way.
awk '{ a=0;for(i=1;i<=NF;i++){for(j=i+1;j<=NF;j++){if($i==$j)a=1} } if( a==1 ) print $0}'
Later I want to find all such duplicate words and delete all the duplicate entries except for 1st occurrence.
So input:
cat dog cat lion cat
dog cat deer
apple peanut banana apple
car bus train plane
car train car train
Desired output:
cat dog lion
dog cat deer
apple peanut banana
car bus train plane
car train
Upvotes: 2
Views: 2885
Reputation: 21492
I'll show solutions in Perl as it is probably the most flexible tool for text parsing, especially when it comes to regular expressions.
perl -ne 'print if m{\b(\S+)\b.*?(\b\1\b)}g' file
where
-n
causes Perl to execute the expression passed via -e
for each input line;\b
matches word boundaries;\S+
matches one or more non-space characters;.*?
is a non-greedy match for zero or more characters;\1
is a backreference to the first group, i.e. the word \S+
;g
globally matches the pattern repeatedly in the string.perl -pe '1 while (s/\b(\S+)\b.*?\K(\s\1\b)//g)' file
where
-p
causes Perl to print the line ($_
), like sed;1 while
loop runs as long as the substitution replaces something;\K
keeps the part matching the previous expression;Duplicate words (\s\1\b
) are replaced with empty string (//g
).
Perl regular expressions are known to be very flexible, and regular expressions in Perl are actually more than just regular expressions. For example, you can embed Perl code into the substitution using the /e
modifier. You can use the /x
modifier that allows to write regular expressions in a more readable format and even use Perl comments in it, e.g.:
perl -pe '1 while (
s/ # Begins substitution: s/pattern/replacement/flags
\b (\S+) \b # A word
.*? # Ungreedy pattern for any number of characters
\K # Keep everything that matched the previous patterns
( # Group for the duplicate word:
\s # - space
\1 # - backreference to the word
\b # - word boundary
)
//xg
)' file
As you should have noticed, the \K
anchor is very convenient, but is not available in many popular tools including awk, bash, and sed.
Upvotes: 0
Reputation: 562388
Here's a solution for printing only lines that contain duplicate words.
awk '{
delete seen
for (i=1;i<=NF;++i) {
if (seen[$i]) { print ; next }
seen[$i] = 1
}
}'
Here's a solution for deleting duplicate words after the first.
awk '{
delete seen
for (i=1;i<=NF;++i) {
if (seen[$i]) { continue }
printf("%s ", $i);
seen[$i] = 1
}
print "";
}'
Re your comment...
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. — Jamie Zawinski, 1997
Upvotes: 2
Reputation: 10149
You can use this GNU sed command:
sed -rn '/(\b\w+\b).*\b\1\b/ p' yourfile
-r
activate extended re and n
deactivates the implicit printing of every linep
command then prints only lines that match the preceding re (inside the slashes):
\b\w+\b are words : an nonemtpy sequence of word charactes (
\w) between word boundaries (
\b`), these are GNU extensions \1
for later reuse, due to the use of parentheses\b\1\b
again with something optional (.*
) between those two places. \1
To answer the second part of the question, deleting the doubled words after the first, but print all lines (modifying only the lines with doubled words), you could use some sed s
magic:
sed -r ':A s/(.*)(\b\w+\b)(.*)\b\2\b(.*)/\1\2\3\4/g; t A ;'
\2
in the matching part of then s
command and we have the other backreferences in the replacement part. \2
has no parens in the matching part and we use all groups in the replacement, thus we effectively deleted the second word of the pair.:A
is a labelt A
jumps to the label if there was a replacement done in the last s
comamnds
to delete the other repetitions, too Upvotes: 3
Reputation: 158030
With egrep
you can use a so called back reference:
egrep '(\b\w+\b).*\b\1\b' file
(\b\w+\b)
matches a word at word boundaries in capturing group 1. \1
references that matched word in the pattern.
Upvotes: 1