quantumDog
quantumDog

Reputation: 33

How can I use awk to remove duplicate entries in the same field with data separated with commas?

I am trying to call awk from a bash script to remove duplicate data entries of a field in a file.

Data Example in file1

data1 a,b,c,d,d,d,c,e

data2 a,b,b,c

Desired Output:

data1 a,b,c,d,e

data2 a,b,c

First I removed the first column to only have the second remaining.

cut --complement -d$'\t' -f1 file1 &> file2

This worked fine, and now I just have the following in file2:

a,b,c,d,d,d,c,e

a,b,b,c

So then I tried this code that I found but do not understand well:

awk '{
    for(i=1; i<=NF; i++)
            printf "%s", (!seen[$1]++? (i==1?"":FS) $i: "" )
    delete seen; print ""
}' file2

The problem is that this code was for a space delimiter and mine is now a comma delimiter with variable values on each row. This code just prints the file as is and I can see no difference. I also tried to make the FS a comma by doing this, to no avail:

printf "%s", (!seen[$1]++? (i==1?"":FS=",") $i: "" 

Upvotes: 1

Views: 261

Answers (4)

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2815

so i did something similar lately - sanitizing the output of gnu prime factoring program when it prints out every single copy of a bunch of small primes :

 gawk -Mbe '
 BEGIN {
     __+=__+=__+=(__+=___=_+=__=____=_^=_<_)-+-++_
     __+=__^=!(___=__-=_+=_++)
     for (_; _<=___; _+=__) {
         if ((_%++__)*(_%(__+--__))) {
             print ____*=_^_
         }
      }
  } | gfactor  | sanitize_gnu_factor

58870952193946852435332666506835273111444209706677713:
    7^7
    11^11
    13^13
    17^17
    
116471448967943114621777995869564336419122830800496825559417754612566153180027:
    7^7
    11^11
    13^13
    17^17
    19^19
    
2431978363071055324951111475877083878108827552605151765803537946846931963403343871776360412541253748541645309:
    7^7
    11^11
    13^13
    17^17
    19^19
    23^23
    
6244557167645217304114386952069758950402417741892127946837837979333340639740318438767128131418285303492993082345658543853142417309747238004933649896921:
    7^7
    11^11
    13^13
    17^17
    19^19
    23^23
    29^29
    
823543:
    7^7
    
234966429149994773:
    7^7
    11^11
    
71165482274405729335192792293569:
    7^7
    11^11
    13^13

And the core sanitizer does basically the same thing - intra-row duplicate removal :

sanitize_gnu_factor()          # i implemented it as a shell function
{
    mawk -Wi -- '
    BEGIN {
        ______ = "[ ]+"
        ___= _+= _^=__*=____ = FS
       _______ = FS = "[ \v"(OFS = "\f\r\t")"]+"
            FS = ____
    } {
       if (/ is prime$/) {
          print; next
       } else if (___==NF) {
          $NF = " - - - - - - - \140\140\140"\
                "PRIME\140\140\140 - - - - - - - "
       } else {
            split("",_____)
                _ = NF
            do { _____[$_]++ } while(--_<(_*_))
                delete _____[""]
            sub("$"," ")
            _^=_<_
            for (__ in _____) {
                 if (+_<+(___=_____[__])) {
                    sub(" "(__)"( "(__)")+ ",
                    sprintf(" %\47.f^%\47.f ",__,___))
            } }
              ___ = _+=_^=__*=_<_
            FS = _______
         $__ = $__
        FS = ____ } } NF = NF' |

    mawk -Wi -- '
        / is prime$/ { print
       next } /[=]/ { gsub("="," ")
                   } $(_^=(_<_)) = \
        (___=length(__=$_))<(_+=_++)^(_+--_) \
              ?__: sprintf("%.*s......%s } %\47.f dgts ",
        _^=++_,__, substr(__,++___-_),--___)' FS='[:]' OFS=':'
}

Upvotes: 0

sseLtaH
sseLtaH

Reputation: 11207

Using GNU sed if applicable

$ sed -E ':a;s/((\<[^,]*\>).*),\2/\1/;ta' input_file
data1 a,b,c,d,e
data2 a,b,c

Upvotes: 0

jhnc
jhnc

Reputation: 16652

This is similar to the code you found.

awk -F'[ ,]' '
    {
        s = $1 " " $2
        seen[$2]++

        for (i=3; i<=NF; i++)
            if (!seen[$i]++) s = s "," $i

        print s
        delete seen
    }
' data-file
  • -F'[ ,]' - split input lines on spaces and commas
  • s = ... - we could use printf like the code you found, but building a string is less typing
  • !seen[x]++ is a common idiom - it returns true only the first time x is seen
  • to avoid special-casing when to print a comma (as your sample code does with spaces), we simply add $2 to the print string and set seen[$2]
  • then for the remaining columns (3 .. NF), we add comma and column if it hasn't been seen before
  • delete seen - clear the array for the next line

Upvotes: 1

WeDBA
WeDBA

Reputation: 343

That code is right, you need to specify the delimiter and change $1 to $i.

$ awk -F ',' '{
    for(i=1; i<=NF; i++)
            printf "%s", (!seen[$i]++? (i==1?"":FS) $i: "" )
    delete seen; print ""
}' /tmp/file1
data1 a,b,c,d,e
data2 a,b,c

Upvotes: 1

Related Questions