Reputation: 699

Delete repetitions in text file by using awk

I have a fragment of text file (this text file is huge):

114303 SOL1443
114311 SOL679
114316 SOL679
114432 SOL1156
114561 SOL122
114574 SOL2000
114952 SOL3018
115597 SOL609
115864 SOL2385
115993 SOL3448
SOL2 61571
SOL3 87990
SOL4 96242
SOL5 6329
SOL5 16550
SOL9 84894
SOL9 84911
SOL12 91985
SOL15 85816

I need to write script which will delete lines which have duplicate SOLnumber. It doesnt matter if SOL is in the first or in the second column For example in text I have

115993 SOL269
SOL269 84911
12373 SOL269

So my script will delete second and third line

SOL269 84911
12373 SOL269

I know that in awk I can use

awk '!seen[$0]++' data.txt

to delete duplicate lines, but it deletes lines which have the same words in every column. Please help me!

Upvotes: 0

Answers (3)

Inian

Reputation: 85895

You need to extract the value of SOL and group the contents of the file based on it. The below command uses the regex match() function to match in the current line containing the pattern SOL followed by digit and store the captured group in variable sol.

Now with the value in the variable, use the logic !unique[sol]++ to list only the lines containing the pattern once.

awk 'match($0, /SOL[[:digit:]]+/){ sol = substr($0, RSTART, RLENGTH);  } !unique[sol]++'

Not saying perl is any better than the above, but you can do

perl -ne '/(SOL\d+)/; print unless $unique{$1}++' file

Upvotes: 1

Corentin Limier

Reputation: 5016

You can do this, same idea than your awk command (just do some preprocessing to select the column to use in seen array:

awk '{if($1 ~ /^SOL/){sol_kw=$1}else{sol_kw=$2}}!seen[sol_kw]++' <file>

Upvotes: 0

AlexisBRENON

Reputation: 3109

As your SOL field is not always at the same place, you first have to find it.

awk '{
end=substr($0, index("SOL", $0))
sol=substr(end, 0, index(" ", end))
}
!seen[sol]++
' data.txt

Upvotes: 0

Delete repetitions in text file by using awk

Answers (3)

Related Questions