Reputation: 33
I am trying to call awk from a bash script to remove duplicate data entries of a field in a file.
Data Example in file1
data1 a,b,c,d,d,d,c,e
data2 a,b,b,c
Desired Output:
data1 a,b,c,d,e
data2 a,b,c
First I removed the first column to only have the second remaining.
cut --complement -d$'\t' -f1 file1 &> file2
This worked fine, and now I just have the following in file2:
a,b,c,d,d,d,c,e
a,b,b,c
So then I tried this code that I found but do not understand well:
awk '{
for(i=1; i<=NF; i++)
printf "%s", (!seen[$1]++? (i==1?"":FS) $i: "" )
delete seen; print ""
}' file2
The problem is that this code was for a space delimiter and mine is now a comma delimiter with variable values on each row. This code just prints the file as is and I can see no difference. I also tried to make the FS a comma by doing this, to no avail:
printf "%s", (!seen[$1]++? (i==1?"":FS=",") $i: ""
Upvotes: 1
Views: 261
Reputation: 2815
so i did something similar lately - sanitizing the output of gnu
prime factor
ing program when it prints out every single copy of a bunch of small primes :
gawk -Mbe '
BEGIN {
__+=__+=__+=(__+=___=_+=__=____=_^=_<_)-+-++_
__+=__^=!(___=__-=_+=_++)
for (_; _<=___; _+=__) {
if ((_%++__)*(_%(__+--__))) {
print ____*=_^_
}
}
} | gfactor | sanitize_gnu_factor
58870952193946852435332666506835273111444209706677713:
7^7
11^11
13^13
17^17
116471448967943114621777995869564336419122830800496825559417754612566153180027:
7^7
11^11
13^13
17^17
19^19
2431978363071055324951111475877083878108827552605151765803537946846931963403343871776360412541253748541645309:
7^7
11^11
13^13
17^17
19^19
23^23
6244557167645217304114386952069758950402417741892127946837837979333340639740318438767128131418285303492993082345658543853142417309747238004933649896921:
7^7
11^11
13^13
17^17
19^19
23^23
29^29
823543:
7^7
234966429149994773:
7^7
11^11
71165482274405729335192792293569:
7^7
11^11
13^13
And the core sanitizer does basically the same thing - intra-row duplicate removal :
sanitize_gnu_factor() # i implemented it as a shell function
{
mawk -Wi -- '
BEGIN {
______ = "[ ]+"
___= _+= _^=__*=____ = FS
_______ = FS = "[ \v"(OFS = "\f\r\t")"]+"
FS = ____
} {
if (/ is prime$/) {
print; next
} else if (___==NF) {
$NF = " - - - - - - - \140\140\140"\
"PRIME\140\140\140 - - - - - - - "
} else {
split("",_____)
_ = NF
do { _____[$_]++ } while(--_<(_*_))
delete _____[""]
sub("$"," ")
_^=_<_
for (__ in _____) {
if (+_<+(___=_____[__])) {
sub(" "(__)"( "(__)")+ ",
sprintf(" %\47.f^%\47.f ",__,___))
} }
___ = _+=_^=__*=_<_
FS = _______
$__ = $__
FS = ____ } } NF = NF' |
mawk -Wi -- '
/ is prime$/ { print
next } /[=]/ { gsub("="," ")
} $(_^=(_<_)) = \
(___=length(__=$_))<(_+=_++)^(_+--_) \
?__: sprintf("%.*s......%s } %\47.f dgts ",
_^=++_,__, substr(__,++___-_),--___)' FS='[:]' OFS=':'
}
Upvotes: 0
Reputation: 11207
Using GNU sed
if applicable
$ sed -E ':a;s/((\<[^,]*\>).*),\2/\1/;ta' input_file
data1 a,b,c,d,e
data2 a,b,c
Upvotes: 0
Reputation: 16652
This is similar to the code you found.
awk -F'[ ,]' '
{
s = $1 " " $2
seen[$2]++
for (i=3; i<=NF; i++)
if (!seen[$i]++) s = s "," $i
print s
delete seen
}
' data-file
-F'[ ,]'
- split input lines on spaces and commass = ...
- we could use printf
like the code you found, but building a string is less typing!seen[x]++
is a common idiom - it returns true only the first time x
is seen$2
to the print string and set seen[$2]
delete seen
- clear the array for the next lineUpvotes: 1
Reputation: 343
That code is right, you need to specify the delimiter and change $1 to $i.
$ awk -F ',' '{
for(i=1; i<=NF; i++)
printf "%s", (!seen[$i]++? (i==1?"":FS) $i: "" )
delete seen; print ""
}' /tmp/file1
data1 a,b,c,d,e
data2 a,b,c
Upvotes: 1