Reputation: 2945

Python 3.3.2 - Finding Image Sources in HTML

I need to locate and extract image sources from an html file. For example, it might contain:

<image class="logo" src="http://example.site/logo.jpg">

<img src="http://another.example/picture.png">

Using Python. I would not like to use any third party programs. I can use the RE module, though. The program should:

sift through everything
seek out the img or image tags
find the src and get the attribute value (without the double quotes)

Is this possible, and if so, how can I do it? We can assume that I don't need to access the internet to do this (I have a file called website.html that contains all the html code).

EDIT: My current Regex expression is

r'<img[^>]*\ssrc="(.*?)"'

and

r'<image[^>]*\ssrc="(.*?)"'.

The main problem is that the expression will pick up anything starting with img or image. For example, if there was something saying <imagesomethingrandom src="website">, it would still count that as an image (as the word image is at the start) and it would add the source.

Thanks in advance.

Rob.

Upvotes: 3

Answers (4)

Marcelo Rodrigues

Reputation: 35

To find some image in a html using soup

from bs4 import BeautifulSoup

url = <img src="http://another.example/picture.png">

a = BeautifulSoup(html, 'html.parser')
b = a.findAll('img')
url_picture = list()
for i in range(0, len(b)):
    image = b[i].attrs['src']
    url_picture.append(image)

Upvotes: 0

Ro Yo Mi

Reputation: 15010

Description

This expression will:

find all image and img tags which have a src attribute
ignore tags which are not image or img, like imagesomethingrandom
capture the value of the src attribute
correctly handle single, double or non quoted attribute values
avoid most of the tricky edge cases which seem to trip up regular expresses when matching html

<ima?ge?(?=\s|>)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=(['"]?)(.*?)\1(?:\s|>))(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>

enter image description here

Examples

Live Regex Demo
Live Python Demo

Sample Text

Note the rather difficult edge cases in the first line

<img onmouseover=' src="NotTheDroidsYouAreLookingFor.png" ; if (x > 3) { funRotate(src); } ' src="http://another.example/picture.png">
<imagesomethingrandom class="logo" src="http://example.site/imagesomethingrandom.jpg">
<image class="logo" src="http://example.site/logo.jpg">
<img src="http://another.example/DoubleQuoted.png">
<image src='http://another.example/SingleQuoted.png'>
<img src=http://another.example/NotQuoted.png>

Python Code

#!/usr/bin/python
import re

string = """<img onmouseover=' src="NotTheDroidsYouAreLookingFor.png" ; if (x > 3) { funRotate(src); } ' src="http://another.example/picture.png">
<imagesomethingrandom class="logo" src="http://example.site/imagesomethingrandom.jpg">
<image class="logo" src="http://example.site/logo.jpg">
<img src="http://another.example/DoubleQuoted.png">
<image src='http://another.example/SingleQuoted.png'>
<img src=http://another.example/NotQuoted.png>
""";

regex = r"""<ima?ge?(?=\s|>)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=(['"]?)(.*?)\1(?:\s|>))(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>""";

intCount = 0

for matchObj in re.finditer( regex, string, re.M|re.I|re.S):
    print " "
    print "[", intCount, "][ 0 ] : ", matchObj.group(0)
    print "[", intCount, "][ 1 ] : ", matchObj.group(1)
    print "[", intCount, "][ 2 ] : ", matchObj.group(2)
    intCount+=1

Capture Groups

Group 0 gets the entire image or img tag
Group 1 gets the quote which surrounded src attribute, if it exists
Group 2 gets the src attribute value

[ 0 ][ 0 ] :  <img onmouseover=' src="NotTheDroidsYouAreLookingFor.png" ; if (x > 3) { funRotate(src); } ' src="http://another.example/picture.png">
[ 0 ][ 1 ] :  "
[ 0 ][ 2 ] :  http://another.example/picture.png

[ 1 ][ 0 ] :  <image class="logo" src="http://example.site/logo.jpg">
[ 1 ][ 1 ] :  "
[ 1 ][ 2 ] :  http://example.site/logo.jpg

[ 2 ][ 0 ] :  <img src="http://another.example/DoubleQuoted.png">
[ 2 ][ 1 ] :  "
[ 2 ][ 2 ] :  http://another.example/DoubleQuoted.png

[ 3 ][ 0 ] :  <image src='http://another.example/SingleQuoted.png'>
[ 3 ][ 1 ] :  '
[ 3 ][ 2 ] :  http://another.example/SingleQuoted.png

[ 4 ][ 0 ] :  <img src=http://another.example/NotQuoted.png>
[ 4 ][ 1 ] :  
[ 4 ][ 2 ] :  http://another.example/NotQuoted.png

Upvotes: 1

Brigand

Reputation: 86250

And an altered version

<ima?ge? # using conditional letters, we match both tags in one expression
\s+      # require at least one space, also includes newlines which are valid
         # prevents <imgbutnotreally> tags
[^>]*?   # similar to the above, but tell it not to be greedy (performance)
\bsrc="([^"]+) # match a space and find all characters in the src tag

rubular

<ima?ge?\s+[^>]*?\src="([^"]+)

Upvotes: 0

Joe P

Reputation: 1

Try BeautifulSoup, just write

from bs4 import BeautifulSoup    
soup = BeautifulSoup(theHTMLtext)
imagesElements = soup.find_all('img')

Upvotes: 0

Python 3.3.2 - Finding Image Sources in HTML

Answers (4)

Description

Examples

rubular

Related Questions