Reputation: 1133
Sorry title is not very clear. So let's say I'm grepping recursively for urls like this:
grep -ERo '(http|https)://[^/"]+' /folder
and in folder there are several files containing the same url. My goal is to output only once this url. I tried to pipe the grep to | uniq or sort -u but that doesn't help
example result:
/www/tmpl/button.tpl.php:http://www.w3.org
/www/tmpl/header.tpl.php:http://www.w3.org
/www/tmpl/main.tpl.php:http://www.w3.org
/www/tmpl/master.tpl.php:http://www.w3.org
/www/tmpl/progress.tpl.php:http://www.w3.org
Upvotes: 1
Views: 1045
Reputation: 52316
If you only want the address and never the file where it was found in, there is a grep option -h
to suppress file output; the list can then be piped to sort -u
to make sure every address appears only once:
$ grep -hERo 'https?://[^/"]+' folder/ | sort -u
http://www.w3.org
If you don't want the https?://
part, you can use Perl regular expressions (-P
instead of -E
) with variable length look-behind (\K
):
$ grep -hPRo 'https?://\K[^/"]+' folder/ | sort -u
www.w3.org
Upvotes: 1
Reputation: 189638
Pipe to Awk:
grep -ERo 'https?://[^/"]+' /folder |
awk -F: '!a[substr($0,length($1))]++'
The basic Awk idiom !a[key]++
is true the first time we see key
, and forever false after that. Extracting the URL (or a reasonable approximation) into the key requires a bit of additional trickery.
This prints the whole input line if the key is one we have not seen before, i.e. it will print the file name and the URL for the first occurrence of each URL from the grep
output.
Doing the whole thing in Awk should not be too hard, either.
Upvotes: 0
Reputation: 1138
If the structure of the output is always:
/some/path/to/file.php:http://www.someurl.org
you can use the command cut
:
cut -d ':' -f 2-
should work. Basically, it cuts each line into fields separated by a delimiter (here ":") and you select the 2nd and following fields (-f 2-)
After that, you can use uniq
to filter.
Upvotes: 0