Jason Carter
Jason Carter

Reputation: 186

How to extract URL from html source with sed/awk or cut?

I am writing a script that will download an html page source as a file and then read the file and extract a specific URL that is located after a specific code. (it only has 1 occurrence)

Here is a sample that I need matched:

<img id="sample-image" class="photo" src="http://xxxx.com/some/ic/pic_1asda963_16x9.jpg"

The code preceding the URL will always be the same so I need to extract the part between:

<img id="sample-image" class="photo" src="

and the " after the URL.

I tried something with sed like this:

sed -n '\<img\ id=\"sample-image\"\ class=\"photo\"\ src=\",\"/p' test.txt

But it does not work. I would appreciate your suggestions, thanks a lot !

Upvotes: 2

Views: 5705

Answers (4)

doubleDown
doubleDown

Reputation: 8398

A few things about the sed command you are using:

sed -n '\<img\ id=\"sample-image\"\ class=\"photo\"\ src=\",\"/p' test.txt
  • You don't need to escape the <, " or space. The single quotes prevents the shell from doing word splitting and other stuff on your sed expression.

  • You are essentially doing this sed -n '/pattern/p' test.txt (except you seemed to be missing the opening backslash) which says "match this pattern, then print the line which contain the match", you are not really extracting the URL.

  • This is minor, but you don't need to match class="photo" since the id already makes the HTML element unique (no two elements share the same id w/in the same HTML).

Here's what I would do

sed -n 's/.*<img id="sample-image".*src="\([^"]+\)".*/\1/p' test.txt
  • The p flag tells sed to print the line where substitution (s) was performed.

  • \(pattern\) captures a subexpression which can be accessed via \1, \2, etc. on the right side of s///

  • The .* at the start of regex is in case there is something else preceding the <img> element on the line (you did mention you are parsing a HTML file)

Upvotes: 1

jaypal singh
jaypal singh

Reputation: 77085

If you have GNU grep then you can do something like:

grep -oP "(?<=src=\")[^\"]+(?=\")" test.txt

If you wish to use awk then the following would work:

awk -F\" '{print $(NF-1)}' test.txt

Upvotes: 3

Gilles Qu&#233;not
Gilles Qu&#233;not

Reputation: 184995

You can use like this :

grep -oP '<img\s+id="sample-image"\s+class="photo"\s+src="\K[^"]+' test.txt

or with :

sed -r 's/<img\s+id="sample-image"\s+class="photo"\s+src="([^"]+)"/\1/' test.txt

or with :

awk -F'src="' -F'"' '/<img\s+id="sample-image"/{print $6}' test.txt

Upvotes: 3

mproffitt
mproffitt

Reputation: 2517

With sed as

echo $string | sed 's/\<img.*src="\(.*\)".*/\1/'

Upvotes: 2

Related Questions