Rob Alsod
Rob Alsod

Reputation: 2945

Python 3.3.2 - Finding Image Sources in HTML

I need to locate and extract image sources from an html file. For example, it might contain:

<image class="logo" src="http://example.site/logo.jpg">

or

<img src="http://another.example/picture.png">

Using Python. I would not like to use any third party programs. I can use the RE module, though. The program should:

Is this possible, and if so, how can I do it? We can assume that I don't need to access the internet to do this (I have a file called website.html that contains all the html code).

EDIT: My current Regex expression is

r'<img[^>]*\ssrc="(.*?)"'

and

r'<image[^>]*\ssrc="(.*?)"'.

The main problem is that the expression will pick up anything starting with img or image. For example, if there was something saying <imagesomethingrandom src="website">, it would still count that as an image (as the word image is at the start) and it would add the source.

Thanks in advance.

Rob.

Upvotes: 3

Views: 2346

Answers (4)

Marcelo Rodrigues
Marcelo Rodrigues

Reputation: 35

To find some image in a html using soup

from bs4 import BeautifulSoup

url = <img src="http://another.example/picture.png">

a = BeautifulSoup(html, 'html.parser')
b = a.findAll('img')
url_picture = list()
for i in range(0, len(b)):
    image = b[i].attrs['src']
    url_picture.append(image)

Upvotes: 0

Ro Yo Mi
Ro Yo Mi

Reputation: 15010

Description

This expression will:

  • find all image and img tags which have a src attribute
  • ignore tags which are not image or img, like imagesomethingrandom
  • capture the value of the src attribute
  • correctly handle single, double or non quoted attribute values
  • avoid most of the tricky edge cases which seem to trip up regular expresses when matching html

<ima?ge?(?=\s|>)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=(['"]?)(.*?)\1(?:\s|>))(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>

enter image description here

Examples

Live Regex Demo
Live Python Demo

Sample Text

Note the rather difficult edge cases in the first line

<img onmouseover=' src="NotTheDroidsYouAreLookingFor.png" ; if (x > 3) { funRotate(src); } ' src="http://another.example/picture.png">
<imagesomethingrandom class="logo" src="http://example.site/imagesomethingrandom.jpg">
<image class="logo" src="http://example.site/logo.jpg">
<img src="http://another.example/DoubleQuoted.png">
<image src='http://another.example/SingleQuoted.png'>
<img src=http://another.example/NotQuoted.png>

Python Code

#!/usr/bin/python
import re

string = """<img onmouseover=' src="NotTheDroidsYouAreLookingFor.png" ; if (x > 3) { funRotate(src); } ' src="http://another.example/picture.png">
<imagesomethingrandom class="logo" src="http://example.site/imagesomethingrandom.jpg">
<image class="logo" src="http://example.site/logo.jpg">
<img src="http://another.example/DoubleQuoted.png">
<image src='http://another.example/SingleQuoted.png'>
<img src=http://another.example/NotQuoted.png>
""";

regex = r"""<ima?ge?(?=\s|>)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=(['"]?)(.*?)\1(?:\s|>))(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>""";

intCount = 0

for matchObj in re.finditer( regex, string, re.M|re.I|re.S):
    print " "
    print "[", intCount, "][ 0 ] : ", matchObj.group(0)
    print "[", intCount, "][ 1 ] : ", matchObj.group(1)
    print "[", intCount, "][ 2 ] : ", matchObj.group(2)
    intCount+=1

Capture Groups

Group 0 gets the entire image or img tag
Group 1 gets the quote which surrounded src attribute, if it exists
Group 2 gets the src attribute value

[ 0 ][ 0 ] :  <img onmouseover=' src="NotTheDroidsYouAreLookingFor.png" ; if (x > 3) { funRotate(src); } ' src="http://another.example/picture.png">
[ 0 ][ 1 ] :  "
[ 0 ][ 2 ] :  http://another.example/picture.png

[ 1 ][ 0 ] :  <image class="logo" src="http://example.site/logo.jpg">
[ 1 ][ 1 ] :  "
[ 1 ][ 2 ] :  http://example.site/logo.jpg

[ 2 ][ 0 ] :  <img src="http://another.example/DoubleQuoted.png">
[ 2 ][ 1 ] :  "
[ 2 ][ 2 ] :  http://another.example/DoubleQuoted.png

[ 3 ][ 0 ] :  <image src='http://another.example/SingleQuoted.png'>
[ 3 ][ 1 ] :  '
[ 3 ][ 2 ] :  http://another.example/SingleQuoted.png

[ 4 ][ 0 ] :  <img src=http://another.example/NotQuoted.png>
[ 4 ][ 1 ] :  
[ 4 ][ 2 ] :  http://another.example/NotQuoted.png

Upvotes: 1

Brigand
Brigand

Reputation: 86250

And an altered version

<ima?ge? # using conditional letters, we match both tags in one expression
\s+      # require at least one space, also includes newlines which are valid
         # prevents <imgbutnotreally> tags
[^>]*?   # similar to the above, but tell it not to be greedy (performance)
\bsrc="([^"]+) # match a space and find all characters in the src tag

rubular

<ima?ge?\s+[^>]*?\src="([^"]+)

Upvotes: 0

Joe P
Joe P

Reputation: 1

Try BeautifulSoup, just write

from bs4 import BeautifulSoup    
soup = BeautifulSoup(theHTMLtext)
imagesElements = soup.find_all('img')

Upvotes: 0

Related Questions