Reputation: 13
for example each line of a file contains something like this:
WordA WordB WordC:WordD WordE WordF
WordA WordB WordC:WordD WordE WordF
WordA WordB WordC:WordD WordE WordF
so i care only about WordC:WordD pattern, and in case of duplicates, i want to somehow retain only unique ones and discard repeated/duplicated ones.
for example, if i initially have lines like this:
cat dog bear:tiger mouse elephant
cat dog bear:lion wolf frog
cat mouse bear:tiger frog wolf
i want to get this result:
cat dog bear:tiger mouse elephant
cat dog bear:lion wolf frog
because, 3rd line's WordC:WordD pattern is duplicated.
So I have tried this but it prints all lines, even with duplicates:
awk '$0 ~ /.*[a-zA-Z]+:[a-zA-Z]+.*/ {if (!seen[$0]) {print; seen[$0] = 1}}' file.txt
Upvotes: 1
Views: 70
Reputation: 36848
I would harness GNU AWK
for this task following way, let file.txt
content be
cat dog bear:tiger mouse elephant
cat dog bear:lion wolf frog
cat mouse bear:tiger frog wolf
then
awk 'BEGIN{FPAT="[[:alpha:]]*:[[:alpha:]]*"}!arr[$1]++{print $1}' file.txt
gives output
bear:tiger
bear:lion
Explanation: I inform GNU AWK
that field is zero-or-more (*
) alphabetic characters followed by colon (:
) followed by zero-or-more (*
) alphabetic characters, thus in example there will be only one field which values are bear:tiger, bear:lion, bear:tiger then I use GNU AWK
way of getting unique value, which for getting unique lines look as follows
awk '!arr[$0]++' file.txt
In this case we need to use 1st field value ($1
) rather than whole line ($0
) and print
just that field not whole line (which is default action and thus applied when no action was explicitly given as in unique lines getter).
Disclaimer: this solution does always output first of duplicates.
(tested in GNU Awk 5.0.1)
Upvotes: 1
Reputation: 163632
One option could be to add the regex match to seen
:
(updated to shorter notations after feedback from @Ed Morton in the comments)
awk 'match($0, /[a-zA-Z]+:[a-zA-Z]+/) && !seen[substr($0, RSTART, RLENGTH)]++' file
Output
cat dog bear:tiger mouse elephant
cat dog bear:lion wolf frog
Or using gnu awk
awk 'match($0, /[a-zA-Z]+:[a-zA-Z]+/, a) && !seen[a[0]]++' file.txt
Upvotes: 2
Reputation: 204676
$ awk '!seen[$3]++' file
cat dog bear:tiger mouse elephant
cat dog bear:lion wolf frog
If that's not all you need then edit your question to provide more truly representative sample input output, e.g. where the Word:Word
string isn't always in the 3rd field and/or can occur multiple times on a line.
Upvotes: 1