Reputation: 151

Delete lines that only contain 1 of each item

I have a rather interesting problem that I'm not sure how to approach. My file looks something like this:

GROUP1, 1 Tall.hat, 1 Bow.tie, 1 Shiny.shoe, 
GROUP2, 1350 Red.apple, 1 Black.pencil, 1 Blue.pen, 1 Green.pen, 1 Little.tree, 
GROUP30, 2 Green.bow, 4 Big.tree, 
GROUP170, 1 Yellow.banana, 2 Green.apple, 1 Blue.skirt, 1 Purple.top, 1 Silver.shoe,
GROUP6, 2 Tall.hat, 2 Bow.tie, 2 Shiny.shoe, 
GROUP7, 20 Red.apple, 20 Black.pencil, 20 Blue.pen, 20 Green.pen, 20 Little.tree,

Each line contains a group which contains items, e.g. GROUP1 contains 1 Tall.hat, 1 Bow.tie and 1 Shiny.shoe. Columns are separated by commas. I want to delete lines (or GROUPs) that only contain 1 of each item.

Desired output:

GROUP2, 1350 Red.apple, 1 Black.pencil, 1 Blue.pen, 1 Green.pen, 1 Little.tree, 
GROUP30, 2 Green.bow, 4 Big.tree, 
GROUP170, 1 Yellow.banana, 2 Green.apple, 1 Blue.skirt, 1 Purple.top, 1 Silver.shoe,
GROUP6, 2 Tall.hat, 2 Bow.tie, 2 Shiny.shoe, 
GROUP7, 20 Red.apple, 20 Black.pencil, 20 Blue.pen, 20 Green.pen, 20 Little.tree,

So GROUP1 has been deleted because it only contains 1 of each item. All other groups have at least one item with two copies or more.

Thoughts so far:

I need to ignore (but retain) column1, since that contains the group number. So start off with something like awk -F "," 'NF>1'. Then for each row, cycle through all the columns and record all the possible numbers found. E.g GROUP1=1; GROUP2=1350 or 1; GROUP30=2 or 4, GROUP170=1 or 2. If the only unique number found is 1, then delete that line.

Not sure how to actually implement this though...Any ideas would be great!

Upvotes: 0

Answers (3)

Ed Morton

Reputation: 204015

$ grep -E ' (1[0-9]|[2-9])' file
GROUP2, 1350 Red.apple, 1 Black.pencil, 1 Blue.pen, 1 Green.pen, 1 Little.tree,
GROUP30, 2 Green.bow, 4 Big.tree,
GROUP170, 1 Yellow.banana, 2 Green.apple, 1 Blue.skirt, 1 Purple.top, 1 Silver.shoe,
GROUP6, 2 Tall.hat, 2 Bow.tie, 2 Shiny.shoe,
GROUP7, 20 Red.apple, 20 Black.pencil, 20 Blue.pen, 20 Green.pen, 20 Little.tree,

Upvotes: 0

Tom Fenech

Reputation: 74685

Here's a solution using awk:

awk -F', *' '{ 
    split("", counts) # empty the counts array at the start of each line
    for (i = 2; i <= NF; ++i) { # loop through fields, starting from 2nd
        split($i, a, /[. ]/) # split each field into parts
        counts[a[3]] += a[1] # accumulate count for each type
        if (counts[a[3]] > 1) { print; next } # print and skip to next line
    }
}' file

-F', *' sets the field separator to a comma followed by any number of spaces. This makes things a bit easier, since the extra spaces are consumed and don't form part of each field $2, $3 later on.

counts will contain keys like "apple", "pencil", "pen", etc. For each key, the value is the total number of occurrences.

If you keep separate counts for "Blue.pen" and "Green.pen", then just split on a single space split($i, a, / /), rather than on spaces and dots. Now each field will only be split into two parts, so replace a[3] with a[2] in the subsequent lines.

splitting an empty string to clear the counts array is a workaround for non-GNU versions of awk, which can be replaced by delete(counts).

Upvotes: 1

RomanPerekhrest

Reputation: 92874

awk solution:

awk -F", *" '{split($2,a," "); n=a[1]; for(i=3;i<NF;i++){ split($i,a," "); 
            if (a[1]!=n) { print; next} else {n=a[1]}} if(n!=1){ print } }' file

The output:

GROUP2, 1350 Red.apple, 1 Black.pencil, 1 Blue.pen, 1 Green.pen, 1 Little.tree, 
GROUP30, 2 Green.bow, 4 Big.tree, 
GROUP170, 1 Yellow.banana, 2 Green.apple, 1 Blue.skirt, 1 Purple.top, 1 Silver.shoe,
GROUP6, 2 Tall.hat, 2 Bow.tie, 2 Shiny.shoe, 
GROUP7, 20 Red.apple, 20 Black.pencil, 20 Blue.pen, 20 Green.pen, 20 Little.tree,

Details:

split($2,a," ") - split an item by space so a[1] is filled with the number and a[2] contains the item name
n=a[1] - capturing the number of the first item in each group
for(i=3;i<NF;i++) - iterating through remaining items
if (a[1]!=n) { print; next} - at the very first case when two consecutive item numbers are different - breaking the loop immediately(jumping to the next line) and printing the "proper" line
if(n!=1){ print } - if all item numbers within the line do not differ (has the same value) and their value is not equal to 1 - print the line, otherwise - the line won't be printed

Upvotes: 0

Delete lines that only contain 1 of each item

Answers (3)

Related Questions