geek1983
geek1983

Reputation: 154

How do I use Regex to find the ID in a YouTube link?

when I try to extract this video ID (AIiMa2Fe-ZQ) with a regex expression, I can't get the dash an all the letters after.

>>> id = re.search('(?<=\?v\=)\w+', 'http://www.youtube.com/watch?v=AIiMa2Fe-ZQ')
>>> print id.group(0)
>>> AIiMa2Fe

Upvotes: 2

Views: 437

Answers (6)

dawg
dawg

Reputation: 104092

/(?:/v/|/watch\?v=|/watch#!v=)([A-Za-z0-9_-]+)/

Explain the RE

There are three alternate YouTube formats: /v/[ID] and watch?v= and the new AJAX watch#!v= This RE captures all three. There is also new YouTube URL for user pages that is of the form /user/[user]?content={complex URI} This is not captured here by any regex...

Upvotes: 1

hughdbrown
hughdbrown

Reputation: 49043

I'd try this:

>>> import re
>>> a = re.compile(r'.*(\-\w+)$')
>>> a.search('http://www.youtube.com/watch?v=AIiMa2Fe-ZQ').group(1)
'-ZQ'

Upvotes: 0

manifest
manifest

Reputation: 2268

I don't know the pattern for youtube hashes, but just include the "-" in the possibilities as it is not considered an alpha:

import re
id = re.search('(?<=\?v\=)[\w-]+', 'http://www.youtube.com/watch?v=AIiMa2Fe-ZQ')
print id.group(0)

I have edited the above because as it turns out:

>>> re.search("[\w|-]", "|").group(0)
'|'

The "|" in the character definition does not act as a special character but does indeed match the "|" pipe. My apologies.

Upvotes: 1

Ivo Wetzel
Ivo Wetzel

Reputation: 46756

Use the urlparse module instead of regex for such kind of things.

import urlparse

parsed_url = urlparse.urlparse(url)
if parsed_url.netloc.find('youtube.com') != -1 and parsed_url.path == '/watch':
    video = urlparse.parse_qs(parsed_url.query).get('v', None)

    if video is None:
        video = urlparse.parse_qs(parsed_url.fragment.strip('!')).get('v', None)

    if video is not None:
        print video[0]

EDIT: Updated for the upcoming new youtube url format.

Upvotes: 1

SilentGhost
SilentGhost

Reputation: 319939

>>> re.search('(?<=v=)[\w-]+', 'http://www.youtube.com/watch?v=AIiMa2Fe-ZQ').group()
'AIiMa2Fe-ZQ'

\w is a short-hand for [a-zA-Z0-9_] in python2.x, you'll have to use re.A flag in py3k. You quite clearly have additional character in that videoid, i.e., hyphen. I've also removed redundant escape backslashes from the lookbehind.

Upvotes: 1

Taylor Leese
Taylor Leese

Reputation: 52390

Intead of \w+ use below. Word character (\w) doesn't include a dash. It only includes [a-zA-Z_0-9].

[\w-]+

Upvotes: 2

Related Questions