Reputation: 1128
I'm new to regex and this is stumping me.
In the following example, I want to extract facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info
. I've read up on lazy quantifiers and lookbehinds but I still can't piece together the right regex. I'd expect facebook.com\/.*?sk=info
to work but it captures too much. Can you guys help?
<i class="mrs fbProfileBylineIcon img sp_2p7iu7 sx_96df30"></i></span><span class="fbProfileBylineLabel"><span itemprop="address" itemscope="itemscope" itemtype="http://schema.org/PostalAddress"><a href="https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?sk=page_map" target="_self">7508 15th Avenue, Brooklyn, New York 11228</a></span></span></span><span class="fbProfileBylineFragment"><span class="fbProfileBylineIconContainer"><i class="mrs fbProfileBylineIcon img sp_2p7iu7 sx_9f18df"></i></span><span class="fbProfileBylineLabel"><span itemprop="telephone">(718) 837-9004</span></span></span></div></div></div><a class="title" href="https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info" aria-label="About Dr. Morris Westfried - Dermatologist">
Upvotes: 1
Views: 434
Reputation: 4628
The problem is that you have an other facebook.com part. You can restrict the .*
not to match "
so it needs to stay within one attribute:
facebook\.com\/[^"]*;sk=info
Upvotes: 2
Reputation: 89604
Why your pattern doesn't work:
You pattern doesn't work because the regex engine try your pattern from left to right in the string.
When the regex engine meets the first facebook.com\/
in the string, and since you use .*?
after, the regex engine will add to the (possible) match result all the characters (including "
or >
or spaces) until it finds sk=info
(since .
can match any characters except newlines).
This is the reason why fejese suggests to replace the dot with [^"]
or aliteralmind suggests to replace it with [^>]
to make the pattern fail at this position in the string (the first).
Using an html parser is the easiest way if you want to deal with html. However, for a ponctual match or search/replace, note that if an html parser provide security, simplicity, it has a cost in term of performance since you need to load the whole tree of your document for a single task.
Upvotes: 2
Reputation: 78011
As much as I love regex, this is an html parsing task:
>>> from bs4 import BeautifulSoup
>>> html = .... # that whole text in the question
>>> soup = BeautifulSoup(html)
>>> pred = lambda tag: tag.attrs['href'].endswith('sk=info')
>>> [tag.attrs['href'] for tag in filter(pred, soup.find_all('a'))]
['https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info']
Upvotes: 4
Reputation: 20163
This works :)
facebook\.com\/[^>]*?sk=info
With only .*
it finds the first facebook.com
, and then continues until the sk=info
. Since there's another facebook.com
between, you overlap them.
The unique thing between that you don't want is a >
(or <
, among other characters), so changing anything to anything but a >
finds the facebook.com
closest to the sk=info
, as you want.
And yes, using regex for HTML should only be used in basic tasks. Otherwise, use a parser.
Upvotes: 3