Reputation: 699
I have a fragment of text file (this text file is huge):
114303 SOL1443
114311 SOL679
114316 SOL679
114432 SOL1156
114561 SOL122
114574 SOL2000
114952 SOL3018
115597 SOL609
115864 SOL2385
115993 SOL3448
SOL2 61571
SOL3 87990
SOL4 96242
SOL5 6329
SOL5 16550
SOL9 84894
SOL9 84911
SOL12 91985
SOL15 85816
I need to write script which will delete lines which have duplicate SOLnumber. It doesnt matter if SOL is in the first or in the second column For example in text I have
115993 SOL269
SOL269 84911
12373 SOL269
So my script will delete second and third line
SOL269 84911
12373 SOL269
I know that in awk I can use
awk '!seen[$0]++' data.txt
to delete duplicate lines, but it deletes lines which have the same words in every column. Please help me!
Upvotes: 0
Views: 89
Reputation: 85895
You need to extract the value of SOL
and group the contents of the file based on it. The below command uses the regex match()
function to match in the current line containing the pattern SOL
followed by digit and store the captured group in variable sol
.
Now with the value in the variable, use the logic !unique[sol]++
to list only the lines containing the pattern once.
awk 'match($0, /SOL[[:digit:]]+/){ sol = substr($0, RSTART, RLENGTH); } !unique[sol]++'
Not saying perl
is any better than the above, but you can do
perl -ne '/(SOL\d+)/; print unless $unique{$1}++' file
Upvotes: 1
Reputation: 5016
You can do this, same idea than your awk command (just do some preprocessing to select the column to use in seen array:
awk '{if($1 ~ /^SOL/){sol_kw=$1}else{sol_kw=$2}}!seen[sol_kw]++' <file>
Upvotes: 0
Reputation: 3109
As your SOL
field is not always at the same place, you first have to find it.
awk '{
end=substr($0, index("SOL", $0))
sol=substr(end, 0, index(" ", end))
}
!seen[sol]++
' data.txt
Upvotes: 0