Reputation: 5157
My string:
Russia's National Settlement Depository discusses why it believes the biggest blockchain opportunities have yet to be uncovered.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw" width="1" />|One of the co-founder of digital currency startup Stellar announced their resignation today.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0" width="1" />|The editorial board for Bloomberg News has called for a permissive regulatory environment for blockchain development.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8" width="1" />|
I wanna get these 3 links into a list:
http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw
http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0
http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8
They obey this pattern:
src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw"
I know that I should use re.findall(pattern, string)
to achieve that.
But the big question is: How can I build a pattern that works here?
I'm not that good at writing regex patterns.. I always get confused... the one that almost got the job done was this one:
pattern = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
But all I got was this list:
[u'http://feeds.feedburner.com/', u'http://feeds.feedburner.com/', u'http://feeds.feedburner.com/']
It looks like the problem is with the ~r
part and the stuff after that.
Upvotes: 0
Views: 119
Reputation: 1990
where is this data coming from ? I'd suggest using an html parser instead of trying to extract with regex. you can pull out the full values from within the tags there if that's coming from html
below i put your string in test.html (for python using beautifulsoup as example)
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open(r'A:\test.html'))
>>> [x['src'] for x in soup.findAll('img')]
['http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw', 'http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0', 'http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8']
Upvotes: 2
Reputation: 3822
try this script :
text1="""Russia's National Settlement Depository discusses why it believes the biggest blockchain opportunities have yet to be uncovered.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw" width="1" />|One of the co-founder of digital currency startup Stellar announced their resignation today.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0" width="1" />|The editorial board for Bloomberg News has called for a permissive regulatory environment for blockchain development.<img alt="" height="1" src="http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8" width="1" />|"""
import re
print re.findall(r'(https?://\S+)', text1)
and the result is
['http://feeds.feedburner.com/~r/CoinDesk/~4/rvoQUj-KDaw"', 'http://feeds.feedburner.com/~r/CoinDesk/~4/xRzN7syt-v0"', 'http://feeds.feedburner.com/~r/CoinDesk/~4/ooQYB2iDxP8"']
Upvotes: 0