Pav Sidhu
Pav Sidhu

Reputation: 6944

Get YouTube video ID from URL with Python and Regex

I would like to retrieve the video ID part of a YouTube URL which is part of a HTML anchor element like so using regex:

<a href="http://www.youtube.com/watch?v=NC2blnl0WTE">Some text</a>

I have looked around for some solutions. I found one from a Javascript solution which took the video ID from the url like so:

/https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?![?=&+%\w.-]*(?:['"][^<>]*>|<\/a>))[?=&+%\w.-]*/ig

I would like to use this in Python as it supports every variance of YouTube's URLs. I implemented it in my Python script:

string = re.sub(r'https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?![?=&+%\w.-]*(?:[\'"][^<>]*>|<\/a>))[?=&+%\w.-]*', r'\1', string)

And I get no replacements. I removed the / and /ig from the regex as they are only in Javascript however I still can't get it to pick up the video ID. Once I am able to pick up the ID, I can easily change around the regex to remove the anchor element.

What have I done wrong with my solution? Thanks.

Upvotes: 2

Views: 3354

Answers (2)

Mike Covington
Mike Covington

Reputation: 2157

I don't think this (scroll right to see part denoted by ^^) is supposed to be a negative lookahead:

https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?![?=&+%\w.-]*(?:['"][^<>]*>|<\/a>))[?=&+%\w.-]*
                                                                                                         ^^

I believe it should be a non-capturing group (i.e., ?! should be ?:).

>>> import re

>>> html = '<a href="http://www.youtube.com/watch?v=NC2blnl0WTE">Some text</a>'
>>> pattern = re.compile(r"""https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?:[?=&+%\w.-]*(?:['"][^<>]*>|<\/a>))[?=&+%\w.-]*""", re.IGNORECASE)
>>> re.search(pattern,  html).groups()
('NC2blnl0WTE',)

EDIT: Notice that I also had to use re.IGNORECASE. This is because the regex, as-is, won't match the www in www.youtube.com. You would need [0-9A-Z-] to be [0-9A-Za-z-]. However, it is safer just ignoring the case so you don't have to worry about other text in the URL.

EDIT2: As a negative lookahead, it means you would never be able to have a match when the URL is followed by the ending and closing of your anchor tag (">blah blah blah</a>).

Upvotes: 2

kwarunek
kwarunek

Reputation: 12577

I use somthing like belowe, based on Youtube I.D parsing for new URL formats, Python regex convert youtube url to youtube video.

import re

test_links = """
    'http://www.youtube.com/watch?v=5Y6HSHwhVlY',
    'http://www.youtube.com/watch?/watch?other_param&v=5Y6HSHwhVlY',
    'http://www.youtube.com/v/5Y6HSHwhVlY',
    'http://youtu.be/5Y6HSHwhVlY', 
    'http://www.youtube.com/embed/5Y6HSHwhVlY?rel=0" frameborder="0"',
    'http://m.youtube.com/v/5Y6HSHwhVlY',
    'https://www.youtube-nocookie.com/v/5Y6HSHwhVlY?version=3&amp;hl=en_US',
    'http://www.youtube.com/',
    'http://www.youtube.com/?feature=ytca
"""

pattern = r'(?:https?:\/\/)?(?:[0-9A-Z-]+\.)?(?:youtube|youtu|youtube-nocookie)\.(?:com|be)\/(?:watch\?v=|watch\?.+&v=|embed\/|v\/|.+\?v=)?([^&=\n%\?]{11})'

result = re.findall(pattern, test_links, re.MULTILINE | re.IGNORECASE)

print(result)

But i really dont know if I am up to date.

edit

allow all subdomians

Upvotes: 4

Related Questions