Amar
Amar

Reputation: 6942

Regex to find external links from the html file using grep

From past few days I'm trying to develop a regex that fetch all the external links from the web pages given to it using grep.

Here is my grep command

grep -h -o -e "\(\(mailto:\|\(\(ht\|f\)tp\(s\?\)\)\)\://\)\{1\}\(.*\?\)" "/mnt/websites_folder/folder_to_search" -r 

now the grep seem to return everything after the external links in that given line

Example

if an html file contain something like this on same line

Google

https://yahoo.com'>Yahoo

then the given grep command return the following result

http://www.google.com">Google</a><p><a href='https://yahoo.com'>Yahoo</a></p>

the idea here is that if an html file contain more than one links(irrespective in a,img etc) in same line then the regex should fetch only the links and not all content of that line

I managed to developed the same in rubular.com the regex is as follow

("|')(\b((ht|f)tps?:\/\/)(.*?)\b)("|')

with work with the above input but iam not able to replicate the same in grep can anyone help I can't modify the html file so don't ask me to do that neither I can look for each specific tags and check their attributes to to get external links as it addup processing time and my application doesn't demand that

Thank You

Upvotes: 4

Views: 5903

Answers (2)

hudolejev
hudolejev

Reputation: 6018

Try this:

cat /path/to/file | egrep -o "(mailto|ftp|http(s)?://){1}[^'\"]+"

egrep -o "(mailto|ftp|http(s)?://){1}[^'\"]+" /path/to/file

Outputs one link per line. It assumes every link is inside single or double quotes. To exclude some certain domain links, use -v:

egrep -o "(mailto|ftp|http(s)?://){1}[^'\"]+" /path/to/file | egrep -v "yahoo.com"

Upvotes: 5

wds
wds

Reputation: 32283

By default grep prints the entire line a match was found on. The -o switch selects only the matched parts of a line. See the man page.

Upvotes: 1

Related Questions