Reputation: 1241
I need to get the unique URLs from a web log and then sort them. I was thinking of using grep, uniq, sort command and output this to another file
I executed this command:
cat access.log | awk '{print $7}' > url.txt
then only get the unique one and sort them:
cat url.txt | uniq | sort > urls.txt
The problem is that I can see duplicates, even though the file is sorted which means my command worked. Why?
Upvotes: 19
Views: 5911
Reputation: 67831
uniq | sort
does not work: uniq
removes contiguous duplicates.
The correct way is sort | uniq
or better sort -u
. Because only one process is spawned.
Upvotes: 26
Reputation: 1322
For nginx access logs, this gives the unique URLs being called:
sed -r "s/.*(GET|POST|PUT|DELETE|HEAD) (.*?) HTTP.*/\2/" /var/log/nginx/access.log | sort | uniq -u
Reference: https://www.guyrutenberg.com/2008/08/10/generating-url-list-from-access-log-access_log/
Upvotes: 0
Reputation: 212198
uniq needs its input sorted, but you sorted after uniq. Try:
$ sort -u < url.txt > urls.txt
Upvotes: 5