Reputation: 6944
I would like to retrieve the video ID part of a YouTube URL which is part of a HTML anchor element like so using regex:
<a href="http://www.youtube.com/watch?v=NC2blnl0WTE">Some text</a>
I have looked around for some solutions. I found one from a Javascript solution which took the video ID from the url like so:
/https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?![?=&+%\w.-]*(?:['"][^<>]*>|<\/a>))[?=&+%\w.-]*/ig
I would like to use this in Python as it supports every variance of YouTube's URLs. I implemented it in my Python script:
string = re.sub(r'https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?![?=&+%\w.-]*(?:[\'"][^<>]*>|<\/a>))[?=&+%\w.-]*', r'\1', string)
And I get no replacements. I removed the /
and /ig
from the regex as they are only in Javascript however I still can't get it to pick up the video ID. Once I am able to pick up the ID, I can easily change around the regex to remove the anchor element.
What have I done wrong with my solution? Thanks.
Upvotes: 2
Views: 3354
Reputation: 2157
I don't think this (scroll right to see part denoted by ^^
) is supposed to be a negative lookahead:
https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?![?=&+%\w.-]*(?:['"][^<>]*>|<\/a>))[?=&+%\w.-]*
^^
I believe it should be a non-capturing group (i.e., ?!
should be ?:
).
>>> import re
>>> html = '<a href="http://www.youtube.com/watch?v=NC2blnl0WTE">Some text</a>'
>>> pattern = re.compile(r"""https?:\/\/(?:[0-9A-Z-]+\.)?(?:youtu\.be\/|youtube(?:-nocookie)?\.com\S*[^\w\s-])([\w-]{11})(?=[^\w-]|$)(?:[?=&+%\w.-]*(?:['"][^<>]*>|<\/a>))[?=&+%\w.-]*""", re.IGNORECASE)
>>> re.search(pattern, html).groups()
('NC2blnl0WTE',)
EDIT: Notice that I also had to use re.IGNORECASE
. This is because the regex, as-is, won't match the www
in www.youtube.com
. You would need [0-9A-Z-]
to be [0-9A-Za-z-]
. However, it is safer just ignoring the case so you don't have to worry about other text in the URL.
EDIT2: As a negative lookahead, it means you would never be able to have a match when the URL is followed by the ending and closing of your anchor tag (">blah blah blah</a>
).
Upvotes: 2
Reputation: 12577
I use somthing like belowe, based on Youtube I.D parsing for new URL formats, Python regex convert youtube url to youtube video.
import re
test_links = """
'http://www.youtube.com/watch?v=5Y6HSHwhVlY',
'http://www.youtube.com/watch?/watch?other_param&v=5Y6HSHwhVlY',
'http://www.youtube.com/v/5Y6HSHwhVlY',
'http://youtu.be/5Y6HSHwhVlY',
'http://www.youtube.com/embed/5Y6HSHwhVlY?rel=0" frameborder="0"',
'http://m.youtube.com/v/5Y6HSHwhVlY',
'https://www.youtube-nocookie.com/v/5Y6HSHwhVlY?version=3&hl=en_US',
'http://www.youtube.com/',
'http://www.youtube.com/?feature=ytca
"""
pattern = r'(?:https?:\/\/)?(?:[0-9A-Z-]+\.)?(?:youtube|youtu|youtube-nocookie)\.(?:com|be)\/(?:watch\?v=|watch\?.+&v=|embed\/|v\/|.+\?v=)?([^&=\n%\?]{11})'
result = re.findall(pattern, test_links, re.MULTILINE | re.IGNORECASE)
print(result)
But i really dont know if I am up to date.
edit
allow all subdomians
Upvotes: 4