doko
doko

Reputation: 15

Extracting hyperlinks from a page using wget and grep

I'm trying to extract all hypelinks within a single page using wget and grep and I found this code using PCRE to get all the hyperlinks.

But I'm not really familiar with regex or html, so I really want to know if this is a sound way of going about it or if there is a better way. I also have a question about it. Do you really need to escape the quotes? I tested it a few times but it doesn't seem to make a difference.

wget https://google.com -q -O - | grep -Po '(?<=href=\")[^\"]*'

Any help will be appreciated!

Upvotes: 1

Views: 861

Answers (1)

vintnes
vintnes

Reputation: 2030

Your command will grab the contents of all href strings href="..." that exist entirely on one line.

You don't need to individually escape your doublequotes \" if the whole string is surrounded by 'single quotes'. The point of quoting is to prevent characters from being interpreted by the shell. The only time you need to escape doublequotes is when you're allowing for expansions, e.g.:

foo=href
grep -Po "(?<=${foo}=\")[^\"]*"

This is exactly identical to

grep -Po '(?<=href=")[^"]*'

Which means

  • Grep, using PCRE
  • return only the match
  • look for any string preceded by (?<=...) the literal string href="
  • match anything that's not a doublequote [^"]
  • zero or more times *

The use of * may return an empty string if you ever parse <a href="">. You could use + (one or more times) instead of * (zero or more times).

Upvotes: 2

Related Questions