user2245653
user2245653

Reputation: 13

modifying duplicate line removing, order retaining, one-line awk command

I am trying to process a data file from an FE code to remove answers generated from un-converged calculations. My files are basically two columns of numbers. I have found a useful AWK solution from another questioner in stackoverflow (Explain this duplicate line removing, order retaining, one-line awk command)

awk '!x[$1]++' file > outFile

this prints only the first line in a group of lines where the value of column a is repeated

however in my data file the correct value in column two would be the last line for which column a is repeated for example:

for a file with data:

a   b
a   c
a   d
b   a
c   d
e   f

awk '!x[$1]++' file > outFile produces

a   b
b   a
c   d
e   f

but I need to generate

a   d
b   a
c   d
e   f

Is it possible to do this by modifying the one like awk?

EDIT by Ed Morton (sorry, I couldn't put this in a comment due to formatting):

Given the posters comment that "the values in colum a may repeat for each node, but I only wish to remove duplicates when they are adjacent" I THINK his real sample input and expected output would be something like this:

Input:

a   b
a   c
a   d
b   a
c   d
a   x
a   y
e   f

Output:

a   d
b   a
c   d
a   y
e   f

For the OP - if I'm wrong, delete the above.

Edit:

Sorry I was trying to simplify my question but have obviously failed to do so adequately. I do not wish to post a full file as these are several mb of txt. Each file contains data output by node result (at least several hundred nodes). Each node data starts with a header section:

                         S:Min Principal (
                         Avg: 75p) PI: BLA
                         DE_MERGE-1 N: 143
              X                  6        

Following each header section is a two column list. The first column are times , the second calculated values at that time point and node. However, when the calculation does not converge there may be repeated entries for a given time stamp. The last entry for each time will be the correct (converged) result. Times may (but may not) repeat between nodes, and one line for each time should be kept in each node.

Below is an example output for one node within a file. This file has only a couple of times that are repeated and could be edited manually. At other nodes the majority of times may appear 10-15 times - the number of repeats varies - as does the expected number of time points.

            0.                 0.         
            2.E-03            -4.43054    
            4.5E-03           -4.43195    
           10.125E-03         -4.43515    
           22.7813E-03        -4.44235    
           51.2578E-03        -4.45856    
          115.33E-03          -4.49509    
          259.493E-03         -4.57752    
          583.859E-03         -4.76425    
            1.31368           -5.19031    
            2.95578           -6.24656    
            6.65051           -8.77117    
           14.9637           -11.385      
           32.4455           -11.385      
           52.4455           -11.385      
           72.4455           -11.385      
           92.4455           -11.385      
          100.               -11.385      
          100.               -11.385      
          102.               -11.385      
          105.75             -11.385      
          114.188            -11.385      
          133.172            -11.385      
          175.887            -11.385      
          271.995            -11.6325     
          458.493            -27.0386     
          600.               -32.1938     
          600.               -32.1938     
          600.2              -32.1939     
          600.575            -32.1943     
          601.419            -32.1938     
          603.317            -32.192      
          607.589            -32.1879     
          617.2              -32.1759     
          638.824            -31.9507     
          687.479            -31.311      
          796.952            -29.3312     
            1.04327E+03      -27.8592     
            1.59748E+03      -25.3054     
            2.84445E+03      -21.0816     
            4.84445E+03      -20.8229     
            6.84445E+03      -20.8229     
            8.84445E+03      -20.8229     
           10.8444E+03       -20.8229     
           12.6E+03          -20.8229     
           12.6E+03          -20.8229     
           12.6002E+03       -20.8229     
           12.6006E+03       -20.8229     
           12.6014E+03       -20.8229     
           12.6033E+03       -20.8229     
           12.6076E+03       -20.8229     
           12.6172E+03       -20.8229     
           12.6388E+03       -20.8229     
           12.6875E+03       -19.8705     
           12.797E+03        -19.8283     
           12.9955E+03       -20.3811     
           13.1955E+03       -20.6489     
           13.3955E+03       -23.6448     
           13.5955E+03       -23.9506     
           13.7955E+03       -27.1146     
           13.9955E+03       -28.8359     
           14.1955E+03       -24.484      
           14.3955E+03       -11.7371     
           14.42E+03         -11.4293  

Upvotes: 1

Views: 1706

Answers (2)

Chris Seymour
Chris Seymour

Reputation: 85883

This is one of those cases where you could use uniq without using sort first. If the first field is fixed width you could simply do:

uniq -w1 file
a   b
b   a
c   d
a   x
e   f

If it's not fixed width use the old rev trick:

rev file | uniq -f1 | rev
a   b
b   a
c   d
a   x
e   f

Note: Using EdMorton's representative input as file.

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 204638

awk 'NR>1 && $1!=p{print s} {p=$1;s=$0} END{print s}' file 
a   d
b   a
c   d
a   y
e   f

Upvotes: 2

Related Questions