RobVious
RobVious

Reputation: 12915

Order-agnostic regex - is it possible?

I'm trying to match the following video url:

<iframe width="420" height="315" src="//www.youtube.com/embed/F40ZBDAG8-o?rel=0" frameborder="0" allowfullscreen></iframe>

I have the following:

^<iframe
(\swidth="\d{1,3}")?
(\sheight="\d{1,3}")?
(\salt=""[^""<>]*"")?
(\stitle=""[^""<>]*"")?
\ssrc="//(www.youtube.com|player.vimeo.com)/[-a-z0-9+&@#/%?=~_|!:,.;\(\)]+"
(\sframeborder="[^""<>]*")?
(\sallowfullscreen)?
\s?/?></iframe>$

This is working, but I can't rely on the fact that youtube will always provide embed links that follow this structure. If they move the width attribute to after src, my regex will fail.

Is there any way to do order-agnostic groupings, to address this?

Upvotes: 2

Views: 431

Answers (1)

Floris
Floris

Reputation: 46375

You can make each of the search terms a lookahead - these don't consume the strings, so they can be in any order. Example:

<iframe (?=.*height="\d{1,3}")(?=.*width="\d{1,3}").*

will match both

<iframe width="123" height="321"

and

<iframe height="321" width="123"

demo on regex101.com

I am sure you can finish this yourself (adding all the terms you want to match).

Note - this "matches" - it does not "extract". But it will tell you that all these terms are present in the expression, in any order.

EDIT since I started writing this answer a number of comments appeared that change my understanding of your request. If you "just" want to extract the src= thing, you simply do

<iframe.*?src="([^"]+)"

and the match (the thing in brackets) will be whatever is between the first and the second double quote. Typically there are better tools than regex for parsing HTML - my personal preference is BeautifulSoup (Python).

Upvotes: 2

Related Questions