dkx22
dkx22

Reputation: 1133

Recursively grep unique pattern in different files

Sorry title is not very clear. So let's say I'm grepping recursively for urls like this:

grep -ERo '(http|https)://[^/"]+' /folder

and in folder there are several files containing the same url. My goal is to output only once this url. I tried to pipe the grep to | uniq or sort -u but that doesn't help

example result:

/www/tmpl/button.tpl.php:http://www.w3.org
/www/tmpl/header.tpl.php:http://www.w3.org
/www/tmpl/main.tpl.php:http://www.w3.org
/www/tmpl/master.tpl.php:http://www.w3.org
/www/tmpl/progress.tpl.php:http://www.w3.org

Upvotes: 1

Views: 1045

Answers (3)

Benjamin W.
Benjamin W.

Reputation: 52316

If you only want the address and never the file where it was found in, there is a grep option -h to suppress file output; the list can then be piped to sort -u to make sure every address appears only once:

$ grep -hERo 'https?://[^/"]+' folder/ | sort -u
http://www.w3.org

If you don't want the https?:// part, you can use Perl regular expressions (-P instead of -E) with variable length look-behind (\K):

$ grep -hPRo 'https?://\K[^/"]+' folder/ | sort -u
www.w3.org

Upvotes: 1

tripleee
tripleee

Reputation: 189638

Pipe to Awk:

grep -ERo 'https?://[^/"]+' /folder |
awk -F: '!a[substr($0,length($1))]++'

The basic Awk idiom !a[key]++ is true the first time we see key, and forever false after that. Extracting the URL (or a reasonable approximation) into the key requires a bit of additional trickery.

This prints the whole input line if the key is one we have not seen before, i.e. it will print the file name and the URL for the first occurrence of each URL from the grep output.

Doing the whole thing in Awk should not be too hard, either.

Upvotes: 0

Bromind
Bromind

Reputation: 1138

If the structure of the output is always: /some/path/to/file.php:http://www.someurl.org

you can use the command cut :

cut -d ':' -f 2- should work. Basically, it cuts each line into fields separated by a delimiter (here ":") and you select the 2nd and following fields (-f 2-)

After that, you can use uniq to filter.

Upvotes: 0

Related Questions