Reputation: 5059
example1.csv
id1, value1
id2, value2
id3, value3
id1, value4
example2.csv
"06e04,0428","405872,8637110"
"06e04,0428","405872,8637111"
"06e04,0429","405872,8637110"
"06e04,0430","405872,8637110"
"06e04,0431","405872,8637111"
Need to remove rows with duplicate values in column1, with output as below
required output
example1_out.csv
id2, value2
id3, value3
example2_out.csv
"06e04,0429","405872,8637110"
"06e04,0430","405872,8637110"
"06e04,0431","405872,8637111"
there are solutions to remove duplicate records which retain one of the duplicate records as in this SO question. However in this case all the rows with duplicate values for column1 need to excluded from the output.
Upvotes: 1
Views: 3730
Reputation: 786091
This awk
can do that in a single command:
awk -F, '{arr[$1]=$0} seen[$1]++{delete arr[$1]} END{for (i in arr) print arr[i]}' file.csv
id2, value2
id3, value3
For your edited question use:
awk -F'","' '{arr[$1]=$0} seen[$1]++{delete arr[$1]} END{for (i in arr) print arr[i]}' file.csv
"06e04,0429","405872,8637110"
"06e04,0430","405872,8637110"
"06e04,0431","405872,8637111"
Upvotes: 1
Reputation: 46886
Here's a shorter awk option.
awk -F, 'NR==FNR{a[$1]++;next} a[$1]<2' file.csv file.csv
This reads the file twice -- once to populate an array of counters of the first field, and the second time to print lines whose count is less than 2.
If you'd prefer to do this in pure shell rather than awk, and your shell is bash
, you could get similar functionality with something like this:
$ declare -A a=()
$ while IFS=, read f _; do ((a[$f]++)); done < file.csv
$ declare -p a
declare -A a=([id1]="2" [id3]="1" [id2]="1" )
$ while IFS=, read f1 f2; do [ "${a[$f1]}" -lt 2 ] && printf '%s,%s\n' "$f1" "$f2"; done < file.csv
id2, value2
id3, value3
Again, this is two steps -- the first to populate an array of counters, the second to step through the file and print appropriate lines.
Upvotes: 1
Reputation: 45503
You can use awk
:
awk -F "," '{
if (length(arr[$1]) == 0){
arr[$1]=$0
order[i++]=$1
}
else{
delete arr[$1]
}
}
END {
for (i = 1; i < length(order); i++) {
print arr[order[i]]
}
}' somecsv.csv
It stores all entries in an array and remove the item if it's found twice. The order is preserved with an additional order
array
Upvotes: 0
Reputation: 242333
cut -f1 -d, somecsv.csv | sort | uniq -u | grep -Ff- somecsv.csv
The first command extracts the first column from the input. The second command sorts the ids, so the next command can only list the unique ones. The final grep
takes the unique ids and searches for them in the input file.
Upvotes: 1