user784637
user784637

Reputation: 16142

Translating a Python Conditional to JavaScript Regex

I'm trying to convert this python regex to a javascript regex

https://github.com/rg3/youtube-dl/blob/a14e1538fe66c49ca8869681d2bbe60a36bd420d/youtube_dl/extractor/youtube.py#L134-L159

r"""(?x)^
(
    (?:https?://|//)?                                    # http(s):// or protocol-independent URL (optional)
    (?:(?:(?:(?:\w+\.)?[yY][oO][uU][tT][uU][bB][eE](?:-nocookie)?\.com/|
    (?:www\.)?deturl\.com/www\.youtube\.com/|
    (?:www\.)?pwnyoutube\.com/|
    (?:www\.)?yourepeat\.com/|
    tube\.majestyc\.net/|
    youtube\.googleapis\.com/)                        # the various hostnames, with wildcard subdomains
    (?:.*?\#/)?                                          # handle anchor (#/) redirect urls
    (?:                                                  # the various things that can precede the ID:
        (?:(?:v|embed|e)/)                               # v/ or embed/ or e/
        |(?:                                             # or the v= param in all its forms
            (?:(?:watch|movie)(?:_popup)?(?:\.php)?/?)?  # preceding watch(_popup|.php) or nothing (like /?v=xxxx)
            (?:\?|\#!?)                                  # the params delimiter ? or # or #!
            (?:.*?&)?                                    # any other preceding param (like /?s=tuff&v=xxxx)
            v=
        )
    ))
    |youtu\.be/                                          # just youtu.be/xxxx
    |https?://(?:www\.)?cleanvideosearch\.com/media/action/yt/watch\?videoId=
    )
)?                                                       # all until now is optional -> you can pass the naked ID
([0-9A-Za-z_-]{11})                                      # here is it! the YouTube video ID
(?(1).+)?                                                # if we found the ID, everything can follow
$"""

I removed the quotes at start and end, added start /^ and end delimiters /i, escaped forward slashes, removed the free-spacing mode and ended up with this

var VALID_URL = /^((?:https?:\/\/|\/\/)?(?:(?:(?:(?:\w+\.)?[yY][oO][uU][tT][uU][bB][eE](?:-nocookie)?\.com\/|(?:www\.)?deturl\.com\/www\.youtube\.com\/|(?:www\.)?pwnyoutube\.com\/|(?:www\.)?yourepeat\.com\/|tube\.majestyc\.net\/|youtube\.googleapis\.com\/)(?:.*?\#\/)?(?:(?:(?:v|embed|e)\/)|(?:(?:(?:watch|movie)(?:_popup)?(?:\.php)?\/?)?(?:\?|\#!?)(?:.*?&)?v=)))|youtu\.be\/|https?:\/\/(?:www\.)?cleanvideosearch\.com\/media\/action\/yt\/watch\?videoId=))?([0-9A-Za-z_-]{11})(?(1).+)?$/g;

However the javascript regex debugger I'm using says Unexpected character "(" after "?" in regards to the javascript transpose of this part of the python regex

(?(1).+)?      # if we found the ID, everything can follow

Any idea how I can resolve this error?

Upvotes: 0

Views: 107

Answers (1)

zx81
zx81

Reputation: 41838

JavaScript does not support conditionals.

But the world of regex has long survived without conditionals, and there are ways around it.

The Idea

The basic structure of that scary regex was this:

(Capture A)? (Match B) ( If A was captured, (Match C)? )

You can translate the IF into an OR:

(Capture A) (Match B) (Match C)? **OR** (Match B)

Converted Regex

Try this:

^((?:https?://|//)?(?:(?:(?:(?:\w+\.)?[yY][oO][uU][tT][uU][bB][eE](?:-nocookie)?\.com/|(?:www\.)?deturl\.com/www\.youtube\.com/|(?:www\.)?pwnyoutube\.com/|(?:www\.)?yourepeat\.com/|tube\.majestyc\.net/|youtube\.googleapis\.com/)(?:[^\n]*?#/)?(?:(?:(?:v|embed|e)/)|(?:(?:(?:watch|movie)(?:_popup)?(?:\.php)?/?)?(?:\?|#!?)(?:[^\n]*?&)?v=)))|youtu\.be/|https?://(?:www\.)?cleanvideosearch\.com/media/action/yt/watch\?videoId=)([0-9A-Za-z_-]{11})(?:[^\n]+)?)|^([0-9A-Za-z_-]{11})

Explanation

The (?(1)[^\n]+)? conditional tries to optionally match [^\n]+ if Group 1 is set. Since it occurs after the non-optional ([0-9A-Za-z_-]{11}), I transformed the conditional into an alternation |

  • I make no judgment about the suitability of the regex... I rearranged the "grammar" without looking at the "words". :)
  • Either we match that whole Group 1, into which we now directly roll the ([0-9A-Za-z_-]{11}) and the optional component, OR
  • We directly match the ([0-9A-Za-z_-]{11})
  • If you are interested in retrieving the ([0-9A-Za-z_-]{11}), depending on which side of the alternation matches it, it will live inside a different capture Group. I'll leave you to count the parentheses.
  • There are probably lots of parentheses you can remove, depending on your needs

Reference

Upvotes: 1

Related Questions