DocWiki
DocWiki

Reputation: 3584

Shell: Extract some code from HTML

I have the following code snippet from a HTML file:

<div id="rwImages_hidden" style="display:none;">
    <img src="http://example.com/images/I/520z3AjKzHL._SL500_AA300_.jpg" style="display:none;"/>
    <img src="http://example.com/images/I/519z3AjKzHL._SL75_AA30_.jpg" style="display:none;"/>
    <img src="http://example.com/images/I/31F-sI61AyL._SL75_AA30_.jpg" style="display:none;"/>
    <img src="http://example.com/images/I/71k-DIrs-8L._AA30_.jpg" style="display:none;"/>
    <img src="http://example.com/images/I/61CCOS0NGyL._AA30_.jpg" style="display:none;"/>
</div>

I want to extract the code

520z3AjKzHL
519z3AjKzHL
31F-sI61AyL
71k-DIrs-8L
61CCOS0NGyL

from the HTML.

Please note that: <img src="" style="display:none;"/> must be used because there are other similar urls in HTML file but I only what the ones between <img src="" style="display:none;"/>.

My Code is:

cat HTML | grep -Po '(?<img src="http://example.com/images/I/).*?(?=.jpg" style="display:none;"/>)'

Something seems to be wrong.

Upvotes: 1

Views: 2045

Answers (3)

ripat
ripat

Reputation: 3236

And if you consider gawk as being a valid bash solution:

awk -F'[/|\._]' -v img='/<img src="" style="display:none;"\/>/' '/img/{print $7}' file

Upvotes: 0

Trojal
Trojal

Reputation: 11

I assume you were looking for a lookbehind to start, which is what was throwing the error.

(?<=foo) not (?<foo).

This gives the result case you specified, but I do not know if you need up until the JPG or not:

cat HTML | grep -Po '(?<=img src="http://example.com/images/I/)[^.]*'

Up until and excluding the JPG would be:

cat HTML | grep -Po '(?<=img src="http://example.com/images/I/).*(?=.jpg)'

Upvotes: 0

aioobe
aioobe

Reputation: 420951

You can solve it by using positive look ahead / look behind:

cat HTML | grep -Po "(?<=<img src=\"http://example.com/images/I/).*?(?=\._.*.jpg\" style=\"display:none;\"/>)"

Demonstration:


Regexp breakdown:

  • .*? match all characters reluctantly
  • (?<=<img src=...ges/I/) preceeded by <img .../I/
  • (?=\._...ne;\"/>) succeeded by ._...ne;\"/>

Upvotes: 2

Related Questions