Get YouTube video ID from URL with Python and Regex

Question

I would like to retrieve the video ID part of a YouTube URL which is part of a HTML anchor element like so using regex:

Some text

I have looked around for some solutions. I found one from a Javascript solution which took the video ID from the url like so:

/https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?![?=&+%\w.-]*(?:['"][^<>]*>|<\/a>))[?=&+%\w.-]*/ig

I would like to use this in Python as it supports every variance of YouTube's URLs. I implemented it in my Python script:

string = re.sub(r'https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?![?=&+%\w.-]*(?:[\'"][^<>]*>|<\/a>))[?=&+%\w.-]*', r'\1', string)

And I get no replacements. I removed the / and /ig from the regex as they are only in Javascript however I still can't get it to pick up the video ID. Once I am able to pick up the ID, I can easily change around the regex to remove the anchor element.

What have I done wrong with my solution? Thanks.

Mike Covington · Accepted Answer

I don't think this (scroll right to see part denoted by ^^) is supposed to be a negative lookahead:

https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?![?=&+%\w.-]*(?:['"][^<>]*>|<\/a>))[?=&+%\w.-]*
                                                                                                         ^^

I believe it should be a non-capturing group (i.e., ?! should be ?:).

>>> import re

>>> html = 'Some text'
>>> pattern = re.compile(r"""https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?:[?=&+%\w.-]*(?:['"][^<>]*>|<\/a>))[?=&+%\w.-]*""", re.IGNORECASE)
>>> re.search(pattern,  html).groups()
('NC2blnl0WTE',)

EDIT: Notice that I also had to use re.IGNORECASE. This is because the regex, as-is, won't match the www in www.youtube.com. You would need [0-9A-Z-] to be [0-9A-Za-z-]. However, it is safer just ignoring the case so you don't have to worry about other text in the URL.

EDIT2: As a negative lookahead, it means you would never be able to have a match when the URL is followed by the ending and closing of your anchor tag (">blah blah blah).

Get YouTube video ID from URL with Python and Regex

Answers (2)

Related Questions