Char
Char

Reputation: 115

How to keep the last occurrence of duplicate lines in a text file?

I have a text file with contents that may be duplicates. Below is a simplified representation of my txt file. text means a unique character or word or phrase). Note that the separator ---------- may not be present. Also, the whole content of the file consists of unicode Japanese and Chinese characters.

EDITED

sometext1
sometext2
sometext3
aaaa
sometext4
aaaa
aaaa
bbbb
bbbb
cccc
dddd
eeee
ffff
gggg
----------
sometext5
eeee
ffff
gggg
sometext6
sometext7:cccc
sometext8:dddd
sometext9
sometext10

What I want to achieve is to keep only the line with the last occurrence of the duplicates like so:

sometext1
sometext2
sometext3
sometext4
aaaa
bbbb
sometext5
eeee
ffff
gggg
sometext6
sometext7:cccc
sometext8:dddd
sometext9
sometext10

The closest I found online is How to remove only the first occurrence of a line in a file using sed but this requires that you know which matching pattern(s) to delete. The suggested topics provided when writing the title gives Duplicating characters using sed and last occurence of date but they didn't work.

I am on a Mac with Sierra. I am writing my executable commands in a script.sh file to execute commands line by line. I'm using sed and gsed as my primary stream editors.

Upvotes: 3

Views: 2679

Answers (5)

user17060100
user17060100

Reputation: 1

Like in the uniq manual:

cat input.txt | uniq -d

Upvotes: 0

v.j
v.j

Reputation: 196

I found a simpler solution but it sorts file in the process. So if u don't mind output in sort format then u can use the following:

$sort -u input.txt > output.txt

Note: the u flag sort the lines of the file listing unique lines.

Upvotes: 0

dawg
dawg

Reputation: 103774

This awk is very close.

Given:

$ cat file
sometext1
sometext2
sometext3
aaaa
sometext4
aaaa
aaaa
bbbb
bbbb
cccc
dddd
eeee
ffff
gggg
----------
sometext5
eeee
ffff
gggg
sometext6
sometext7:cccc
sometext8:dddd
sometext9
sometext10

You can do:

$ awk 'BEGIN{FS=":"} 
        FNR==NR {for (i=1; i<=NF; i++) {dup[$i]++; last[$i]=NR;} next}
        /^$/ {next}
        {for (i=1; i<=NF; i++) 
            if (dup[$i] && FNR==last[$i]) {print $0; next}}
        ' file file
sometext1
sometext2
sometext3
sometext4
aaaa
bbbb
----------
sometext5
eeee
ffff
gggg
sometext6
sometext7:cccc
sometext8:dddd
sometext9
sometext10

Upvotes: 1

codeforester
codeforester

Reputation: 42999

I am not sure if your intent is to preserve the original order of the lines. If that is the case, you could do this:

export LC_ALL=en_US.utf8 # to handle unicode characters in file
nl -n rz -ba file | sort -k2,2 -t$'\t' | uniq -f1 | sort -k1,1 | cut -f2
  • nl -n rz -ba file adds zero padded line numbers to the file
  • sort -k2,2 -t'$\t' sorts the output of nl by the second field (note that nl puts a tab after the line number)
  • uniq -f1 removes the duplicates, while ignoring the line number field (-f1)
  • the final sort restores the original order of the lines, with duplicates removed
  • cut -f2 removes the line number field, restoring the content to the original format

Upvotes: 5

potong
potong

Reputation: 58381

This might work for you (GNU sed):

sed -r '1h;1!H;x;s/([^\n]+)\n(.*\1)$/\2/;s/\n-+$//;x;$!d;x' file

Store the first line in the hold space (HS) and append every subsequent line. Swap to the HS and remove any duplicate line that matches the last line. Also delete any separator lines and then swap back to the pattern space (PS). Delete all but the last line, which is swapped with the HS and printed out.

Upvotes: 0

Related Questions