Reputation: 3584
I have the following code snippet from a HTML file:
<div id="rwImages_hidden" style="display:none;">
<img src="http://example.com/images/I/520z3AjKzHL._SL500_AA300_.jpg" style="display:none;"/>
<img src="http://example.com/images/I/519z3AjKzHL._SL75_AA30_.jpg" style="display:none;"/>
<img src="http://example.com/images/I/31F-sI61AyL._SL75_AA30_.jpg" style="display:none;"/>
<img src="http://example.com/images/I/71k-DIrs-8L._AA30_.jpg" style="display:none;"/>
<img src="http://example.com/images/I/61CCOS0NGyL._AA30_.jpg" style="display:none;"/>
</div>
I want to extract the code
520z3AjKzHL
519z3AjKzHL
31F-sI61AyL
71k-DIrs-8L
61CCOS0NGyL
from the HTML.
Please note that: <img src="" style="display:none;"/>
must be used because there are other similar urls in HTML file but I only what the ones between <img src="" style="display:none;"/>
.
My Code is:
cat HTML | grep -Po '(?<img src="http://example.com/images/I/).*?(?=.jpg" style="display:none;"/>)'
Something seems to be wrong.
Upvotes: 1
Views: 2045
Reputation: 3236
And if you consider gawk as being a valid bash solution:
awk -F'[/|\._]' -v img='/<img src="" style="display:none;"\/>/' '/img/{print $7}' file
Upvotes: 0
Reputation: 11
I assume you were looking for a lookbehind to start, which is what was throwing the error.
(?<=foo)
not (?<foo)
.
This gives the result case you specified, but I do not know if you need up until the JPG or not:
cat HTML | grep -Po '(?<=img src="http://example.com/images/I/)[^.]*'
Up until and excluding the JPG would be:
cat HTML | grep -Po '(?<=img src="http://example.com/images/I/).*(?=.jpg)'
Upvotes: 0
Reputation: 420951
You can solve it by using positive look ahead / look behind:
cat HTML | grep -Po "(?<=<img src=\"http://example.com/images/I/).*?(?=\._.*.jpg\" style=\"display:none;\"/>)"
Demonstration:
Regexp breakdown:
.*?
match all characters reluctantly(?<=<img src=...ges/I/)
preceeded by <img .../I/
(?=\._...ne;\"/>)
succeeded by ._...ne;\"/>
Upvotes: 2