rohitkulky
rohitkulky

Reputation: 1232

Remove rows in a csv which has a specific entry in one column and entries in other columns are repeated

I have stumbled upon this problem, which I solved, by hook or by crook. But I need your help in a precise solution. Being a beginner in awk/sed, I could not solve it with a one liner (which I am sure there is), or an awk script, though with a lot of pipes I could.

Here is the question:

I have a large .csv file with entries similar to this:

file (space delimited)

$ cat file
a d e r None
c f g r a
c f g r None
d a d e c
d a d e None
g f r t None
g f t r None
k f r e d
r e t y None
s c d er d
g f r t 4

There are no duplicates. But if you see carefully, there are entries in columns 1,2,3,4 that are repeating, and the only change is 5th column, with 'None'. So I need to remove the rows (records) which are repeating in the 1,2,3,4 fields and that have None in 5th column.

Here is the code i wrote, which worked, but no one would recommend:

awk '{print $5,$4,$3,$2,$1}' file | sed 's/None/zzz/g' | sort | awk '!array[$2,$3,$4,$5]++' | sed 's/zzz/None/g'

and here is the output that I got, and am expecting.

4 t r f g
a r g f c
c e d a d
d e r f k
d er d c s
None r e d a
None r t f g
None y t e r

Purpose of replacing None with zzz was after sorting, the rows will appear in the end, and awk will remove the second occurrences of the duplicates from remaining columns. Same is the reason for inverting column sequence and re-inverting it back. Sort.

Please if you could help. Thanks!

Upvotes: 1

Views: 753

Answers (1)

Kent
Kent

Reputation: 195229

I got this solution:

awk '{s=$4" "$3" "$2" "$1; if($5=="None"&& s in a)next;else a[s]=$5" "s}END{for(i in a)print a[i]}' file|sort

it outputs:

kent$  awk '{s=$4" "$3" "$2" "$1; if($5=="None"&& s in a)next;else a[s]=$5" "s}END{for(i in a)print a[i]}' file|sort
4 t r f g
a r g f c
c e d a d
d er d c s
d e r f k
None r e d a
None r t f g
None y t e r

it seems that it is same as your expectation.

Upvotes: 3

Related Questions