Reputation: 12915
I'm trying to match the following video url:
<iframe width="420" height="315" src="//www.youtube.com/embed/F40ZBDAG8-o?rel=0" frameborder="0" allowfullscreen></iframe>
I have the following:
^<iframe
(\swidth="\d{1,3}")?
(\sheight="\d{1,3}")?
(\salt=""[^""<>]*"")?
(\stitle=""[^""<>]*"")?
\ssrc="//(www.youtube.com|player.vimeo.com)/[-a-z0-9+&@#/%?=~_|!:,.;\(\)]+"
(\sframeborder="[^""<>]*")?
(\sallowfullscreen)?
\s?/?></iframe>$
This is working, but I can't rely on the fact that youtube will always provide embed links that follow this structure. If they move the width
attribute to after src
, my regex will fail.
Is there any way to do order-agnostic groupings, to address this?
Upvotes: 2
Views: 431
Reputation: 46375
You can make each of the search terms a lookahead - these don't consume the strings, so they can be in any order. Example:
<iframe (?=.*height="\d{1,3}")(?=.*width="\d{1,3}").*
will match both
<iframe width="123" height="321"
and
<iframe height="321" width="123"
I am sure you can finish this yourself (adding all the terms you want to match).
Note - this "matches" - it does not "extract". But it will tell you that all these terms are present in the expression, in any order.
EDIT since I started writing this answer a number of comments appeared that change my understanding of your request. If you "just" want to extract the src=
thing, you simply do
<iframe.*?src="([^"]+)"
and the match (the thing in brackets) will be whatever is between the first and the second double quote. Typically there are better tools than regex for parsing HTML - my personal preference is BeautifulSoup (Python).
Upvotes: 2