Alex B
Alex B

Reputation: 13

Want to remove lines from file with awk that contain duplicated pattern

for example each line of a file contains something like this:

WordA WordB WordC:WordD WordE WordF
WordA WordB WordC:WordD WordE WordF
WordA WordB WordC:WordD WordE WordF

so i care only about WordC:WordD pattern, and in case of duplicates, i want to somehow retain only unique ones and discard repeated/duplicated ones.

for example, if i initially have lines like this:

cat dog bear:tiger mouse elephant
cat dog bear:lion wolf frog
cat mouse bear:tiger frog wolf

i want to get this result:

cat dog bear:tiger mouse elephant
cat dog bear:lion wolf frog

because, 3rd line's WordC:WordD pattern is duplicated.

So I have tried this but it prints all lines, even with duplicates:

awk '$0 ~ /.*[a-zA-Z]+:[a-zA-Z]+.*/ {if (!seen[$0]) {print; seen[$0] = 1}}' file.txt

Upvotes: 1

Views: 70

Answers (3)

Daweo
Daweo

Reputation: 36848

I would harness GNU AWK for this task following way, let file.txt content be

cat dog bear:tiger mouse elephant
cat dog bear:lion wolf frog
cat mouse bear:tiger frog wolf

then

awk 'BEGIN{FPAT="[[:alpha:]]*:[[:alpha:]]*"}!arr[$1]++{print $1}' file.txt

gives output

bear:tiger
bear:lion

Explanation: I inform GNU AWK that field is zero-or-more (*) alphabetic characters followed by colon (:) followed by zero-or-more (*) alphabetic characters, thus in example there will be only one field which values are bear:tiger, bear:lion, bear:tiger then I use GNU AWK way of getting unique value, which for getting unique lines look as follows

awk '!arr[$0]++' file.txt

In this case we need to use 1st field value ($1) rather than whole line ($0) and print just that field not whole line (which is default action and thus applied when no action was explicitly given as in unique lines getter).

Disclaimer: this solution does always output first of duplicates.

(tested in GNU Awk 5.0.1)

Upvotes: 1

The fourth bird
The fourth bird

Reputation: 163632

One option could be to add the regex match to seen:

(updated to shorter notations after feedback from @Ed Morton in the comments)

awk 'match($0, /[a-zA-Z]+:[a-zA-Z]+/) && !seen[substr($0, RSTART, RLENGTH)]++' file

Output

cat dog bear:tiger mouse elephant
cat dog bear:lion wolf frog

Or using gnu awk

awk 'match($0, /[a-zA-Z]+:[a-zA-Z]+/, a) && !seen[a[0]]++' file.txt

Upvotes: 2

Ed Morton
Ed Morton

Reputation: 204676

$ awk '!seen[$3]++' file
cat dog bear:tiger mouse elephant
cat dog bear:lion wolf frog

If that's not all you need then edit your question to provide more truly representative sample input output, e.g. where the Word:Word string isn't always in the 3rd field and/or can occur multiple times on a line.

Upvotes: 1

Related Questions