dot.Py
dot.Py

Reputation: 5157

How can I extract a specific img src url format using regex?

My string:

Russia's National Settlement Depository discusses why it believes the biggest blockchain opportunities have yet to be uncovered.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw" width="1" />|One of the co-founder of digital currency startup Stellar announced their resignation today.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0" width="1" />|The editorial board for Bloomberg News has called for a permissive regulatory environment for blockchain development.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8" width="1" />|

I wanna get these 3 links into a list:

http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw
http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0
http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8

They obey this pattern:

src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw"

I know that I should use re.findall(pattern, string) to achieve that.

But the big question is: How can I build a pattern that works here?

I'm not that good at writing regex patterns.. I always get confused... the one that almost got the job done was this one:

pattern = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

But all I got was this list:

[u'http://feeds.feedburner.com/', u'http://feeds.feedburner.com/', u'http://feeds.feedburner.com/']

It looks like the problem is with the ~r part and the stuff after that.

Upvotes: 0

Views: 119

Answers (4)

NikT
NikT

Reputation: 1990

where is this data coming from ? I'd suggest using an html parser instead of trying to extract with regex. you can pull out the full values from within the tags there if that's coming from html

below i put your string in test.html (for python using beautifulsoup as example)

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open(r'A:\test.html'))
>>> [x['src'] for x in soup.findAll('img')]
['http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw', 'http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0', 'http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8']

Upvotes: 2

Shekhar Khairnar
Shekhar Khairnar

Reputation: 2691

try this :

(?:src=)(".*?")

and get group \1

DEMO

Upvotes: 0

khelili miliana
khelili miliana

Reputation: 3822

try this script :

text1="""Russia's National Settlement Depository discusses why it believes the biggest blockchain opportunities have yet to be uncovered.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw" width="1" />|One of the co-founder of digital currency startup Stellar announced their resignation today.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0" width="1" />|The editorial board for Bloomberg News has called for a permissive regulatory environment for blockchain development.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8" width="1" />|"""
import re
print re.findall(r'(https?://\S+)', text1)

and the result is

['http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw"',   'http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0"', 'http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8"']

Upvotes: 0

Ward
Ward

Reputation: 2852

You are missing the ~ character in your regex:

http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+~]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+

btw: this is a super way to test regex in Python!

Upvotes: 0

Related Questions