Reputation: 79
I've got to find the images in a html source code. I'm using regex instead of html.parser because I know it better, but if you can explain to me how to use html parsing like you would a child, I'll be happy to go down that road too.
Can't use beautifulsoup, wish I could, but I got to learn to do this the hard way.
I've read through a lot of questions and answers on here on regex and html (example) so I'm aware of the feelings on this topic.
But hear me out!
Here's my coding attempt (Python 3):
import urllib.request
import re
website = urllib.request.urlopen('http://google.com')
html = website.read()
pat = re.compile (r'<img [^>]*src="([^"]+)')
img = pat.findall(html)
I double checked my regex on regex101.com and it works at finding the img link, but when I run it on IDLE, I get a syntax error and keeps highlighting the caret. Why?
I'm headed in the right direction... yes?
update: Hi, I was thinking may be I get short quick answer, but it seems I may touched a nerve in the community.
I am definitely new and terrible at programming, no way around that. I've been reading all the comments and I really appreciate all the help and patience users have shown me.
Upvotes: 4
Views: 5845
Reputation: 82490
Instead of using urllib
, I used requests
, you can download it from here. They do the same thing, I just like requests
better since it has a better API. The regex string is only slightly changed. \s
is just added in case there are a few whites spaces before the img
tag. You were headed in the right direction. You can find out more about the re
module here.
Here is the code
import requests
import re
website = requests.get('http://stackoverflow.com//')
html = website.text
pat = re.compile(r'<\s*img [^>]*src="([^"]+)')
img = pat.findall(html)
print img
And the output:
[u'https://i.sstatic.net/tKsDb.png', u'https://i.sstatic.net/L8rHf.png', u'https://i.sstatic.net/tKsDb.png', u'https://i.sstatic.net/Ryr18.png', u'https://i.sstatic.net/ASf0H.png', u'https://i.sstatic.net/tKsDb.png', u'https://i.sstatic.net/tKsDb.png', u'https://i.sstatic.net/tKsDb.png', u'https://i.sstatic.net/Ryr18.png', u'https://i.sstatic.net/VgvXl.png', u'https://i.sstatic.net/tKsDb.png', u'https://i.sstatic.net/tKsDb.png', u'https://i.sstatic.net/tKsDb.png', u'https://i.sstatic.net/tKsDb.png', u'https://i.sstatic.net/6QN0y.png', u'http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif']
Upvotes: 1
Reputation: 33046
There is nothing wrong with the regex, you are missing two things:
raw
string so that the string is passed as-is to the regex compiler, without any escape interpretation.read()
call is a byte sequence, not a string. So you need a byte sequence regex.The second one is Python3-specific (and I see that you are using Py3)
Putting all together, just fix the aforementioned line like this:
pat = re.compile (rb'<img [^>]*src="([^"]+)')
r
stands for raw and b
for byte sequence.
Additionally, test on a website that actually embeds images in <img>
tags, like http://stackoverflow.com. You will not find anything when processing http://google.com
Here we go:
Python 3.3.2+
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.request
>>> import re
>>> website = urllib.request.urlopen('http://stackoverflow.com/')
>>> html = website.read()
>>> pat = re.compile (rb'<img [^>]*src="([^"]+)')
>>> img = pat.findall(html)
>>> img
[b'https://i.sstatic.net/tKsDb.png', b'https://i.sstatic.net/dmHl0.png', b'https://i.sstatic.net/dmHl0.png', b'https://i.sstatic.net/tKsDb.png', b'https://i.sstatic.net/6QN0y.png', b'https://i.sstatic.net/tKsDb.png', b'https://i.sstatic.net/L8rHf.png', b'https://i.sstatic.net/tKsDb.png', b'http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif']
Upvotes: 3
Reputation: 3020
re.compile (r'<img [^>]*src="([^"]+)')
you are missing the quotation marks (single or double) around the pattern
Upvotes: 0