Reputation: 186
I am writing a script that will download an html page source as a file and then read the file and extract a specific URL that is located after a specific code. (it only has 1 occurrence)
Here is a sample that I need matched:
<img id="sample-image" class="photo" src="http://xxxx.com/some/ic/pic_1asda963_16x9.jpg"
The code preceding the URL will always be the same so I need to extract the part between:
<img id="sample-image" class="photo" src="
and the "
after the URL.
I tried something with sed like this:
sed -n '\<img\ id=\"sample-image\"\ class=\"photo\"\ src=\",\"/p' test.txt
But it does not work. I would appreciate your suggestions, thanks a lot !
Upvotes: 2
Views: 5705
Reputation: 8398
A few things about the sed
command you are using:
sed -n '\<img\ id=\"sample-image\"\ class=\"photo\"\ src=\",\"/p' test.txt
You don't need to escape the <
, "
or space. The single quotes prevents the shell from doing word splitting and other stuff on your sed
expression.
You are essentially doing this sed -n '/pattern/p' test.txt
(except you seemed to be missing the opening backslash) which says "match this pattern, then print the line which contain the match", you are not really extracting the URL.
This is minor, but you don't need to match class="photo"
since the id
already makes the HTML element unique (no two elements share the same id w/in the same HTML).
Here's what I would do
sed -n 's/.*<img id="sample-image".*src="\([^"]+\)".*/\1/p' test.txt
The p
flag tells sed
to print the line where substitution (s
) was performed.
\(pattern\)
captures a subexpression which can be accessed via \1
, \2
, etc. on the right side of s///
The .*
at the start of regex is in case there is something else preceding the <img>
element on the line (you did mention you are parsing a HTML file)
Upvotes: 1
Reputation: 77085
If you have GNU
grep then you can do something like:
grep -oP "(?<=src=\")[^\"]+(?=\")" test.txt
If you wish to use awk
then the following would work:
awk -F\" '{print $(NF-1)}' test.txt
Upvotes: 3
Reputation: 184995
You can use grep like this :
grep -oP '<img\s+id="sample-image"\s+class="photo"\s+src="\K[^"]+' test.txt
or with sed :
sed -r 's/<img\s+id="sample-image"\s+class="photo"\s+src="([^"]+)"/\1/' test.txt
or with awk :
awk -F'src="' -F'"' '/<img\s+id="sample-image"/{print $6}' test.txt
Upvotes: 3