John Yuki
John Yuki

Reputation: 310

Get YouTube video url or YouTube video ID from a string using RegEx

So I've been stuck on this for about an hour or so now and I just cannot get it to work. So far I have been trying to extract the whole link from the string, but now I feel like it might be easier to just get the video ID.

The RegEx would need to take the ID/URL from the following link styles, no matter where they are in a string:

http://youtube.com/watch?v=iwGFalTRHDA
http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related
https://youtube.com/iwGFalTRHDA
http://youtu.be/n17B_uFF4cA
youtube.com/iwGFalTRHDA
youtube.com/n17B_uFF4cA
http://www.youtube.com/embed/watch?feature=player_embedded&v=r5nB9u4jjy4
http://www.youtube.com/watch?v=t-ZRX8984sc
http://youtu.be/t-ZRX8984sc

So far, I have this RegEx:

((http(s)?\:\/\/)?(www\.)?(youtube|youtu)((\.com|\.be)\/)(watch\?v=)?([0-z]{11}|[0-z]{4}(\-|\_)[0-z]{4}|.(\-|\_)[0-z]{9}))

This catches the link, however it also breaks down the link in to multiple parts and also adds that to the list too, so if a string contains a single youtube link, the output when I print the list is something like this:

('https://www.youtube.com/watch?v=Idn7ODPMhFY', 'https://', 's', 'www.', 'youtube', '.com/', '.com', 'watch?v=', 'Idn7ODPMhFY', '', '')

I need the list to only contain the link itself, or just the video id (which would be more preferable). I have really tried doing this myself for quite a while now but I just cannot figure it out. I was wondering if someone could sort out the regex for me and tell me where I am going wrong so that I don't run in to this issue again in the future?

Upvotes: 1

Views: 3144

Answers (3)

willeM_ Van Onsem
willeM_ Van Onsem

Reputation: 476574

Instead of writing a complicated regex that probably work not in all cases, you better use tools to analyze the url, like urllib:

from urllib.parse import urlparse, parse_qs

url = 'http://youtube.com/watch?v=iwGFalTRHDA'

def get_id(url):
    u_pars = urlparse(url)
    quer_v = parse_qs(u_pars.query).get('v')
    if quer_v:
        return quer_v[0]
    pth = u_pars.path.split('/')
    if pth:
        return pth[-1]

This function will return None if both attempts fail.

I tested it with the sample urls:

>>> get_id('http://youtube.com/watch?v=iwGFalTRHDA')
'iwGFalTRHDA'
>>> get_id('http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related')
'iwGFalTRHDA'
>>> get_id('https://youtube.com/iwGFalTRHDA')
'iwGFalTRHDA'
>>> get_id('http://youtu.be/n17B_uFF4cA')
'n17B_uFF4cA'
>>> get_id('youtube.com/iwGFalTRHDA')
'iwGFalTRHDA'
>>> get_id('youtube.com/n17B_uFF4cA')
'n17B_uFF4cA'
>>> get_id('http://www.youtube.com/embed/watch?feature=player_embedded&v=r5nB9u4jjy4')
'r5nB9u4jjy4'
>>> get_id('http://www.youtube.com/watch?v=t-ZRX8984sc')
't-ZRX8984sc'
>>> get_id('http://youtu.be/t-ZRX8984sc')
't-ZRX8984sc'

Upvotes: 7

Lukas Graf
Lukas Graf

Reputation: 32580

Here's the approach I'd use, no regex needed at all.

(This is pretty much equivalent to @Willem Van Onsem's solution, plus an easy to run / update unit test).

from urlparse import parse_qs
from urlparse import urlparse
import re
import unittest


TEST_URLS = [
    ('iwGFalTRHDA', 'http://youtube.com/watch?v=iwGFalTRHDA'),
    ('iwGFalTRHDA', 'http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related'),
    ('iwGFalTRHDA', 'https://youtube.com/iwGFalTRHDA'),
    ('n17B_uFF4cA', 'http://youtu.be/n17B_uFF4cA'),
    ('iwGFalTRHDA', 'youtube.com/iwGFalTRHDA'),
    ('n17B_uFF4cA', 'youtube.com/n17B_uFF4cA'),
    ('r5nB9u4jjy4', 'http://www.youtube.com/embed/watch?feature=player_embedded&v=r5nB9u4jjy4'),
    ('t-ZRX8984sc', 'http://www.youtube.com/watch?v=t-ZRX8984sc'),
    ('t-ZRX8984sc', 'http://youtu.be/t-ZRX8984sc'),
    (None, 'http://www.stackoverflow.com')
]

YOUTUBE_DOMAINS = [
    'youtu.be',
    'youtube.com',
]


def extract_id(url_string):
    # Make sure all URLs start with a valid scheme
    if not url_string.lower().startswith('http'):
        url_string = 'http://%s' % url_string

    url = urlparse(url_string)

    # Check host against whitelist of domains
    if url.hostname.replace('www.', '') not in YOUTUBE_DOMAINS:
        return None

    # Video ID is usually to be found in 'v' query string
    qs = parse_qs(url.query)
    if 'v' in qs:
        return qs['v'][0]

    # Otherwise fall back to path component
    return url.path.lstrip('/')


class TestExtractID(unittest.TestCase):

    def test_extract_id(self):
        for expected_id, url in TEST_URLS:
            result = extract_id(url)
            self.assertEqual(
                expected_id, result, 'Failed to extract ID from '
                'URL %r (got %r, expected %r)' % (url, result, expected_id))


if __name__ == '__main__':
    unittest.main()

Upvotes: 3

Dekel
Dekel

Reputation: 62536

I really advise on @LukasGraf's comment, however if you really must use regex you can check the following:

(?:(?:https?\:\/\/)?(?:www\.)?(?:youtube|youtu)(?:(?:\.com|\.be)\/)(?:embed\/)?(?:watch\?)?(?:feature=player_embedded)?&?(?:v=)?([0-z]{11}|[0-z]{4}(\-|\_)[0-z]{4}|.(\-|\_)[0-z]{9}))

Here is a working example in regex101: https://regex101.com/r/5eRqn2/1

And here the python example:

In [38]: r = re.compile('(?:(?:https?\:\/\/)?(?:www\.)?(?:youtube|youtu)(?:(?:\.com|\.be)\/)(?:embed\/)?(?:watch\?)?(?:feature=player_embedded)?&?(?:v=)?([0-z]{11}|[0-z]{4}(?:\-|\_)[0-z]{4}|.(?:\-|\_)[0-z]{9}))')
In [39]: r.match('http://youtube.com/watch?v=iwGFalTRHDA').groups()
Out[39]: ('iwGFalTRHDA',)
In [40]: r.match('http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related').groups()
Out[40]: ('iwGFalTRHDA',)
In [41]: r.match('https://youtube.com/iwGFalTRHDA').groups()
Out[41]: ('iwGFalTRHDA',)

In order to not catch specific group in regex you should this: (?:...)

Upvotes: 2

Related Questions