Reputation: 13
I am trying to process a data file from an FE code to remove answers generated from un-converged calculations. My files are basically two columns of numbers. I have found a useful AWK solution from another questioner in stackoverflow (Explain this duplicate line removing, order retaining, one-line awk command)
awk '!x[$1]++' file > outFile
this prints only the first line in a group of lines where the value of column a is repeated
however in my data file the correct value in column two would be the last line for which column a is repeated for example:
for a file with data:
a b
a c
a d
b a
c d
e f
awk '!x[$1]++' file > outFile
produces
a b
b a
c d
e f
but I need to generate
a d
b a
c d
e f
Is it possible to do this by modifying the one like awk?
EDIT by Ed Morton (sorry, I couldn't put this in a comment due to formatting):
Given the posters comment that "the values in colum a may repeat for each node, but I only wish to remove duplicates when they are adjacent" I THINK his real sample input and expected output would be something like this:
Input:
a b
a c
a d
b a
c d
a x
a y
e f
Output:
a d
b a
c d
a y
e f
For the OP - if I'm wrong, delete the above.
Edit:
Sorry I was trying to simplify my question but have obviously failed to do so adequately. I do not wish to post a full file as these are several mb of txt. Each file contains data output by node result (at least several hundred nodes). Each node data starts with a header section:
S:Min Principal (
Avg: 75p) PI: BLA
DE_MERGE-1 N: 143
X 6
Following each header section is a two column list. The first column are times , the second calculated values at that time point and node. However, when the calculation does not converge there may be repeated entries for a given time stamp. The last entry for each time will be the correct (converged) result. Times may (but may not) repeat between nodes, and one line for each time should be kept in each node.
Below is an example output for one node within a file. This file has only a couple of times that are repeated and could be edited manually. At other nodes the majority of times may appear 10-15 times - the number of repeats varies - as does the expected number of time points.
0. 0.
2.E-03 -4.43054
4.5E-03 -4.43195
10.125E-03 -4.43515
22.7813E-03 -4.44235
51.2578E-03 -4.45856
115.33E-03 -4.49509
259.493E-03 -4.57752
583.859E-03 -4.76425
1.31368 -5.19031
2.95578 -6.24656
6.65051 -8.77117
14.9637 -11.385
32.4455 -11.385
52.4455 -11.385
72.4455 -11.385
92.4455 -11.385
100. -11.385
100. -11.385
102. -11.385
105.75 -11.385
114.188 -11.385
133.172 -11.385
175.887 -11.385
271.995 -11.6325
458.493 -27.0386
600. -32.1938
600. -32.1938
600.2 -32.1939
600.575 -32.1943
601.419 -32.1938
603.317 -32.192
607.589 -32.1879
617.2 -32.1759
638.824 -31.9507
687.479 -31.311
796.952 -29.3312
1.04327E+03 -27.8592
1.59748E+03 -25.3054
2.84445E+03 -21.0816
4.84445E+03 -20.8229
6.84445E+03 -20.8229
8.84445E+03 -20.8229
10.8444E+03 -20.8229
12.6E+03 -20.8229
12.6E+03 -20.8229
12.6002E+03 -20.8229
12.6006E+03 -20.8229
12.6014E+03 -20.8229
12.6033E+03 -20.8229
12.6076E+03 -20.8229
12.6172E+03 -20.8229
12.6388E+03 -20.8229
12.6875E+03 -19.8705
12.797E+03 -19.8283
12.9955E+03 -20.3811
13.1955E+03 -20.6489
13.3955E+03 -23.6448
13.5955E+03 -23.9506
13.7955E+03 -27.1146
13.9955E+03 -28.8359
14.1955E+03 -24.484
14.3955E+03 -11.7371
14.42E+03 -11.4293
Upvotes: 1
Views: 1706
Reputation: 85883
This is one of those cases where you could use uniq
without using sort
first. If the first field is fixed width you could simply do:
uniq -w1 file
a b
b a
c d
a x
e f
If it's not fixed width use the old rev
trick:
rev file | uniq -f1 | rev
a b
b a
c d
a x
e f
Note: Using EdMorton's representative input as file
.
Upvotes: 1
Reputation: 204638
awk 'NR>1 && $1!=p{print s} {p=$1;s=$0} END{print s}' file
a d
b a
c d
a y
e f
Upvotes: 2