CODEWITHSUNDEEP

linuxbashshellsedawk

Reputation: 35

How to remove lines with duplicate pair of words?

I have a file with multiple columns like

abc cvn bla..bla..n_columns
xnt yuk m_columns
abc cvn xxxx
vbh ast
sth rty
xnt yuk

I want to create a new file by comparing the repeated word pairs in first two columns. The final file will look like

abc cvn bla..bla..n_columns
xnt yuk m_columns
vbh ast
sth rty

Upvotes: 0

Views: 1096

Answers (3)

NeronLeVelu

Reputation: 10039

sed -n 'H
$ {x
   s/$/\
/
: again
   s/\(\n\)\([^ ]\{1,\} \{1,\}[^ [:cntrl:]]\{1,\}\)\(.*\)\1\2[^[:cntrl:]]*\n/\1\2\3\1/
   t again
   s/\n\(.*\)\n/\1/
   p
   }' YourFile

based on any repeated peer of value (pair is character not space or \n separate by "space") in whole text with a loop while there is a peer finnded and replaced.

principle

H Append each line (sed work line by line in work buffer) from working buffer into the hold buffer (there is a working buffer and a hold buffer)
$ at the end
x swap working and hold buffer, so all the file is in working buffer but starting with a new line (due to Append action)
s/... Add a New line at the end (for later substitution process delimiter)
: again put a label anchor (for a later goto)
s/...// is the core of the process. Search a starting (after a new line) peer of word and a later same starting peer, if find, substitute the whole block with the part from start of block until second peer not included. (block start at first peer until new line on same line as second peer)
t again if substitution earlier is made, go to label again
s/.../ remove the added new line at start and end
p print the result

Sed is trying always to take the mose of a pattern so if there is more than 2 peer of 1 of the uniq peer, it first remove the last peer and go back until there is only 1

Upvotes: 0

Reputation: 203169

All you need is:

awk '!seen[$1,$2]++' file

Upvotes: 5

ray

Reputation: 4267

If abc cvn xxxx appears before abc cvn bla..bla..n_columns I just want to keep any of the line. It does not matter for me which line should be there. Any of the line will be okay.

If the output sequence doesn't matter, you can use sort

sort -u -k1,2 file

otherwise you should use awk as suggested by devnull

Upvotes: 0

Related Questions