Reputation: 151
I have a rather interesting problem that I'm not sure how to approach. My file looks something like this:
GROUP1, 1 Tall.hat, 1 Bow.tie, 1 Shiny.shoe,
GROUP2, 1350 Red.apple, 1 Black.pencil, 1 Blue.pen, 1 Green.pen, 1 Little.tree,
GROUP30, 2 Green.bow, 4 Big.tree,
GROUP170, 1 Yellow.banana, 2 Green.apple, 1 Blue.skirt, 1 Purple.top, 1 Silver.shoe,
GROUP6, 2 Tall.hat, 2 Bow.tie, 2 Shiny.shoe,
GROUP7, 20 Red.apple, 20 Black.pencil, 20 Blue.pen, 20 Green.pen, 20 Little.tree,
Each line contains a group which contains items, e.g. GROUP1 contains 1 Tall.hat, 1 Bow.tie and 1 Shiny.shoe. Columns are separated by commas. I want to delete lines (or GROUPs) that only contain 1 of each item.
Desired output:
GROUP2, 1350 Red.apple, 1 Black.pencil, 1 Blue.pen, 1 Green.pen, 1 Little.tree,
GROUP30, 2 Green.bow, 4 Big.tree,
GROUP170, 1 Yellow.banana, 2 Green.apple, 1 Blue.skirt, 1 Purple.top, 1 Silver.shoe,
GROUP6, 2 Tall.hat, 2 Bow.tie, 2 Shiny.shoe,
GROUP7, 20 Red.apple, 20 Black.pencil, 20 Blue.pen, 20 Green.pen, 20 Little.tree,
So GROUP1 has been deleted because it only contains 1 of each item. All other groups have at least one item with two copies or more.
Thoughts so far:
I need to ignore (but retain) column1, since that contains the group number. So start off with something like awk -F "," 'NF>1'
. Then for each row, cycle through all the columns and record all the possible numbers found. E.g GROUP1=1; GROUP2=1350 or 1; GROUP30=2 or 4, GROUP170=1 or 2. If the only unique number found is 1, then delete that line.
Not sure how to actually implement this though...Any ideas would be great!
Upvotes: 0
Views: 41
Reputation: 204015
$ grep -E ' (1[0-9]|[2-9])' file
GROUP2, 1350 Red.apple, 1 Black.pencil, 1 Blue.pen, 1 Green.pen, 1 Little.tree,
GROUP30, 2 Green.bow, 4 Big.tree,
GROUP170, 1 Yellow.banana, 2 Green.apple, 1 Blue.skirt, 1 Purple.top, 1 Silver.shoe,
GROUP6, 2 Tall.hat, 2 Bow.tie, 2 Shiny.shoe,
GROUP7, 20 Red.apple, 20 Black.pencil, 20 Blue.pen, 20 Green.pen, 20 Little.tree,
Upvotes: 0
Reputation: 74685
Here's a solution using awk:
awk -F', *' '{
split("", counts) # empty the counts array at the start of each line
for (i = 2; i <= NF; ++i) { # loop through fields, starting from 2nd
split($i, a, /[. ]/) # split each field into parts
counts[a[3]] += a[1] # accumulate count for each type
if (counts[a[3]] > 1) { print; next } # print and skip to next line
}
}' file
-F', *'
sets the field separator to a comma followed by any number of spaces. This makes things a bit easier, since the extra spaces are consumed and don't form part of each field $2
, $3
later on.
counts
will contain keys like "apple", "pencil", "pen", etc. For each key, the value is the total number of occurrences.
If you keep separate counts for "Blue.pen" and "Green.pen", then just split on a single space split($i, a, / /)
, rather than on spaces and dots. Now each field will only be split into two parts, so replace a[3]
with a[2]
in the subsequent lines.
split
ting an empty string to clear the counts
array is a workaround for non-GNU versions of awk, which can be replaced by delete(counts)
.
Upvotes: 1
Reputation: 92874
awk solution:
awk -F", *" '{split($2,a," "); n=a[1]; for(i=3;i<NF;i++){ split($i,a," ");
if (a[1]!=n) { print; next} else {n=a[1]}} if(n!=1){ print } }' file
The output:
GROUP2, 1350 Red.apple, 1 Black.pencil, 1 Blue.pen, 1 Green.pen, 1 Little.tree,
GROUP30, 2 Green.bow, 4 Big.tree,
GROUP170, 1 Yellow.banana, 2 Green.apple, 1 Blue.skirt, 1 Purple.top, 1 Silver.shoe,
GROUP6, 2 Tall.hat, 2 Bow.tie, 2 Shiny.shoe,
GROUP7, 20 Red.apple, 20 Black.pencil, 20 Blue.pen, 20 Green.pen, 20 Little.tree,
Details:
split($2,a," ")
- split an item by space so a[1]
is filled with the number and a[2]
contains the item name
n=a[1]
- capturing the number of the first item in each group
for(i=3;i<NF;i++)
- iterating through remaining items
if (a[1]!=n) { print; next}
- at the very first case when two consecutive item numbers are different - breaking the loop immediately(jumping to the next
line) and printing the "proper" line
if(n!=1){ print }
- if all item numbers within the line do not differ (has the same value) and their value is not equal to 1
- print the line, otherwise - the line won't be printed
Upvotes: 0