Reputation: 15
I'm trying to extract all hypelinks within a single page using wget and grep and I found this code using PCRE to get all the hyperlinks.
But I'm not really familiar with regex or html, so I really want to know if this is a sound way of going about it or if there is a better way. I also have a question about it. Do you really need to escape the quotes? I tested it a few times but it doesn't seem to make a difference.
wget https://google.com -q -O - | grep -Po '(?<=href=\")[^\"]*'
Any help will be appreciated!
Upvotes: 1
Views: 861
Reputation: 2030
Your command will grab the contents of all href strings href="..."
that exist entirely on one line.
You don't need to individually escape your doublequotes \"
if the whole string is surrounded by 'single quotes'
. The point of quoting is to prevent characters from being interpreted by the shell. The only time you need to escape doublequotes is when you're allowing for expansions, e.g.:
foo=href
grep -Po "(?<=${foo}=\")[^\"]*"
This is exactly identical to
grep -Po '(?<=href=")[^"]*'
Which means
P
CREo
nly the match(?<=...)
the literal string href="
[^"]
*
The use of *
may return an empty string if you ever parse <a href="">
. You could use +
(one or more times) instead of *
(zero or more times).
Upvotes: 2