Reputation: 154
when I try to extract this video ID (AIiMa2Fe-ZQ) with a regex expression, I can't get the dash an all the letters after.
>>> id = re.search('(?<=\?v\=)\w+', 'http://www.youtube.com/watch?v=AIiMa2Fe-ZQ')
>>> print id.group(0)
>>> AIiMa2Fe
Upvotes: 2
Views: 437
Reputation: 104092
/(?:/v/|/watch\?v=|/watch#!v=)([A-Za-z0-9_-]+)/
There are three alternate YouTube formats: /v/[ID]
and watch?v=
and the new AJAX watch#!v=
This RE captures all three. There is also new YouTube URL for user pages that is of the form /user/[user]?content={complex URI} This is not captured here by any regex...
Upvotes: 1
Reputation: 49043
I'd try this:
>>> import re
>>> a = re.compile(r'.*(\-\w+)$')
>>> a.search('http://www.youtube.com/watch?v=AIiMa2Fe-ZQ').group(1)
'-ZQ'
Upvotes: 0
Reputation: 2268
I don't know the pattern for youtube hashes, but just include the "-" in the possibilities as it is not considered an alpha:
import re
id = re.search('(?<=\?v\=)[\w-]+', 'http://www.youtube.com/watch?v=AIiMa2Fe-ZQ')
print id.group(0)
I have edited the above because as it turns out:
>>> re.search("[\w|-]", "|").group(0)
'|'
The "|" in the character definition does not act as a special character but does indeed match the "|" pipe. My apologies.
Upvotes: 1
Reputation: 46756
Use the urlparse module instead of regex for such kind of things.
import urlparse
parsed_url = urlparse.urlparse(url)
if parsed_url.netloc.find('youtube.com') != -1 and parsed_url.path == '/watch':
video = urlparse.parse_qs(parsed_url.query).get('v', None)
if video is None:
video = urlparse.parse_qs(parsed_url.fragment.strip('!')).get('v', None)
if video is not None:
print video[0]
EDIT: Updated for the upcoming new youtube url format.
Upvotes: 1
Reputation: 319939
>>> re.search('(?<=v=)[\w-]+', 'http://www.youtube.com/watch?v=AIiMa2Fe-ZQ').group()
'AIiMa2Fe-ZQ'
\w
is a short-hand for [a-zA-Z0-9_]
in python2.x, you'll have to use re.A
flag in py3k. You quite clearly have additional character in that videoid, i.e., hyphen. I've also removed redundant escape backslashes from the lookbehind.
Upvotes: 1
Reputation: 52390
Intead of \w+ use below. Word character (\w) doesn't include a dash. It only includes [a-zA-Z_0-9].
[\w-]+
Upvotes: 2