Reputation: 135

Remove duplicate lines from file

I have a list of URLs, most of which are duplicates:

> http://example.com/some/a-test-link.html
> http://example.com/some/a-test-link.html
> http://example.com/some/another-link.html
> http://example.com/some/another-link.html
> http://example.com/some/again-link.html
> http://example.com/some/again-link.html

I don't need the same link twice, so I need to remove duplicates and keep only one link. How can I do this using regular expressions, or sed, or awk (I am not sure which technology would be best). I am using Ubuntu as the operating system and Sublime Text 3 as my editor.

Upvotes: 3

Answers (5)

Ed Morton

Reputation: 203597

$ sort -u file
> http://example.com/some/again-link.html
> http://example.com/some/another-link.html
> http://example.com/some/a-test-link.html

Upvotes: 3

Cole Tierney

Reputation: 10314

You could also use a combination of sort and uniq:

sort input.txt | uniq

Sorting groups the duplicate links and uniq deletes all consecutive repeated links.

Upvotes: 2

potong

Reputation: 58430

This might work for you (GNU sed):

sed -r 'G;/(http[^\n]*)\n.*\1/d;s/\n.*//;H' file

Use the hold space to hold previously seen URL's and delete lines which contain duplicates.

Upvotes: 2

jaypal singh

Reputation: 77105

Very trivial using awk:

awk '!seen[$0]++' file

which basically means:

awk "!($0 in seen) {seen[$0];print}"

So if the line is not in the array it will add to it and print it. All subsequent lines if they exist in the array will be skipped.

$ cat file
> http://example.com/some/a-test-link.html
> http://example.com/some/a-test-link.html
> http://example.com/some/another-link.html
> http://example.com/some/another-link.html
> http://example.com/some/again-link.html
> http://example.com/some/again-link.html
$ awk '!seen[$0]++' file
> http://example.com/some/a-test-link.html
> http://example.com/some/another-link.html
> http://example.com/some/again-link.html

Upvotes: 4

Pedro Lobito

Reputation: 98921

Not sure if this works for you, but, if the links are in the order you've posted, the following regex will give you just unique results.

/(http:\/\/.*?)\s+(?:\1)/gm

http://regex101.com/r/zB0pW3

Upvotes: 1

Remove duplicate lines from file

Answers (5)

Related Questions