Reputation: 186

How to extract URL from html source with sed/awk or cut?

I am writing a script that will download an html page source as a file and then read the file and extract a specific URL that is located after a specific code. (it only has 1 occurrence)

Here is a sample that I need matched:

<img id="sample-image" class="photo" src="http://xxxx.com/some/ic/pic_1asda963_16x9.jpg"

The code preceding the URL will always be the same so I need to extract the part between:

<img id="sample-image" class="photo" src="

and the " after the URL.

I tried something with sed like this:

sed -n '\<img\ id=\"sample-image\"\ class=\"photo\"\ src=\",\"/p' test.txt

But it does not work. I would appreciate your suggestions, thanks a lot !

Upvotes: 2

Answers (4)

doubleDown

Reputation: 8398

A few things about the sed command you are using:

sed -n '\<img\ id=\"sample-image\"\ class=\"photo\"\ src=\",\"/p' test.txt

You don't need to escape the <, " or space. The single quotes prevents the shell from doing word splitting and other stuff on your sed expression.
You are essentially doing this sed -n '/pattern/p' test.txt (except you seemed to be missing the opening backslash) which says "match this pattern, then print the line which contain the match", you are not really extracting the URL.
This is minor, but you don't need to match class="photo" since the id already makes the HTML element unique (no two elements share the same id w/in the same HTML).

Here's what I would do

sed -n 's/.*<img id="sample-image".*src="\([^"]+\)".*/\1/p' test.txt

The p flag tells sed to print the line where substitution (s) was performed.
\(pattern\) captures a subexpression which can be accessed via \1, \2, etc. on the right side of s///
The .* at the start of regex is in case there is something else preceding the <img> element on the line (you did mention you are parsing a HTML file)

Upvotes: 1

jaypal singh

Reputation: 77085

If you have GNU grep then you can do something like:

grep -oP "(?<=src=\")[^\"]+(?=\")" test.txt

If you wish to use awk then the following would work:

awk -F\" '{print $(NF-1)}' test.txt

Upvotes: 3

Gilles Quénot

Reputation: 184995

You can use grep like this :

grep -oP '<img\s+id="sample-image"\s+class="photo"\s+src="\K[^"]+' test.txt

or with sed :

sed -r 's/<img\s+id="sample-image"\s+class="photo"\s+src="([^"]+)"/\1/' test.txt

or with awk :

awk -F'src="' -F'"' '/<img\s+id="sample-image"/{print $6}' test.txt

Upvotes: 3

mproffitt

Reputation: 2517

With sed as

echo $string | sed 's/\<img.*src="\(.*\)".*/\1/'

Upvotes: 2

How to extract URL from html source with sed/awk or cut?

Answers (4)

Related Questions