Reputation: 115
I have a text file with contents that may be duplicates. Below is a simplified representation of my txt file. text
means a unique character or word or phrase). Note that the separator ----------
may not be present. Also, the whole content of the file consists of unicode Japanese and Chinese characters.
EDITED
sometext1
sometext2
sometext3
aaaa
sometext4
aaaa
aaaa
bbbb
bbbb
cccc
dddd
eeee
ffff
gggg
----------
sometext5
eeee
ffff
gggg
sometext6
sometext7:cccc
sometext8:dddd
sometext9
sometext10
What I want to achieve is to keep only the line with the last occurrence of the duplicates like so:
sometext1
sometext2
sometext3
sometext4
aaaa
bbbb
sometext5
eeee
ffff
gggg
sometext6
sometext7:cccc
sometext8:dddd
sometext9
sometext10
The closest I found online is How to remove only the first occurrence of a line in a file using sed but this requires that you know which matching pattern(s) to delete. The suggested topics provided when writing the title gives Duplicating characters using sed and last occurence of date but they didn't work.
I am on a Mac with Sierra. I am writing my executable commands in a script.sh file to execute commands line by line. I'm using sed
and gsed
as my primary stream editors.
Upvotes: 3
Views: 2679
Reputation: 196
I found a simpler solution but it sorts file in the process. So if u don't mind output in sort format then u can use the following:
$sort -u input.txt > output.txt
Note: the u flag sort the lines of the file listing unique lines.
Upvotes: 0
Reputation: 103774
This awk is very close.
Given:
$ cat file
sometext1
sometext2
sometext3
aaaa
sometext4
aaaa
aaaa
bbbb
bbbb
cccc
dddd
eeee
ffff
gggg
----------
sometext5
eeee
ffff
gggg
sometext6
sometext7:cccc
sometext8:dddd
sometext9
sometext10
You can do:
$ awk 'BEGIN{FS=":"}
FNR==NR {for (i=1; i<=NF; i++) {dup[$i]++; last[$i]=NR;} next}
/^$/ {next}
{for (i=1; i<=NF; i++)
if (dup[$i] && FNR==last[$i]) {print $0; next}}
' file file
sometext1
sometext2
sometext3
sometext4
aaaa
bbbb
----------
sometext5
eeee
ffff
gggg
sometext6
sometext7:cccc
sometext8:dddd
sometext9
sometext10
Upvotes: 1
Reputation: 42999
I am not sure if your intent is to preserve the original order of the lines. If that is the case, you could do this:
export LC_ALL=en_US.utf8 # to handle unicode characters in file
nl -n rz -ba file | sort -k2,2 -t$'\t' | uniq -f1 | sort -k1,1 | cut -f2
nl -n rz -ba file
adds zero padded line numbers to the filesort -k2,2 -t'$\t'
sorts the output of nl
by the second field (note that nl
puts a tab after the line number)uniq -f1
removes the duplicates, while ignoring the line number field (-f1
)sort
restores the original order of the lines, with duplicates removedcut -f2
removes the line number field, restoring the content to the original formatUpvotes: 5
Reputation: 58381
This might work for you (GNU sed):
sed -r '1h;1!H;x;s/([^\n]+)\n(.*\1)$/\2/;s/\n-+$//;x;$!d;x' file
Store the first line in the hold space (HS) and append every subsequent line. Swap to the HS and remove any duplicate line that matches the last line. Also delete any separator lines and then swap back to the pattern space (PS). Delete all but the last line, which is swapped with the HS and printed out.
Upvotes: 0