Reputation: 11

How to exclude lines with duplicate strings using grep

I have a text file with the following output.

good,bad,ugly
good,good,ugly
good,good,good,bad,ugly
good,bad,bad
bad,bad,bad,bad,good
bad,ugly,good
bad,good,bad
good,good,good,good,bad
ugly,bad,good
bad,bad,bad,good,ugly

I only want to list lines that have a single occurrence of ugly and bad. Any line with multiple bads needs to be excluded.

good,bad,ugly
good,good,good,bad,ugly
bad,ugly,good
ugly,bad,good

I've tried to use the following, but it still lists lines with multiple bads.

grep -E "bad|ugly" file.txt | grep -v "\('bad'\).*\1"

Upvotes: -1

Answers (3)

Ed Morton

Reputation: 204416

grep isn't the best choice for data that contains fields or whenever your requirements have multiple conditions or arithmetic to test. Using any awk:

$ awk -F, '
    { delete cnt; for (i=1; i<=NF; i++) cnt[$i]++ }
    (cnt["ugly"] == 1) && (cnt["bad"] == 1)
' file
good,bad,ugly
good,good,good,bad,ugly
bad,ugly,good
ugly,bad,good

Unlike the grep solutions posted so far, the above would do the [presumably] right thing if your input contained other similar strings like badlands or your target strings contained regexp metachars like b.*.

Also imagine how trivial it'd be to update that vs updating a grep command to work with counts of any additional strings and/or different counts of bad and ugly.

Upvotes: 2

Philippe

Reputation: 26727

You have to use -P (for Perl-compatible regular expressions) for back-references.

grep -E "bad|ugly" file.txt | grep -Pv "(bad).*\1"

Upvotes: 0

Fourat Ben Driaa

Reputation: 1246

Your current approach using grep -E "bad|ugly" matches any line with either "bad" OR "ugly", and the back-reference attempt isn't quite working.

grep -E 'bad.*ugly|ugly.*bad' file.txt | grep -v 'bad.*bad'

This will give you:

good,bad,ugly
good,good,ugly,bad
ugly,bad,good

Upvotes: 3

How to exclude lines with duplicate strings using grep

Answers (3)

Related Questions