Menelaos Vergis
Menelaos Vergis

Reputation: 3955

Extract the video ID from youtube url in .net

I am struggling with a regex to extract the video ID from a youtube url.

"(?:.+?)?(?:\\/v\\/|watch\\/|\\?v=|\\&v=|youtu\\.be\\/|\\/v=|^youtu\\.be\\/)([a-zA-Z0-9_-]{11})+";

It's working since it matches the video ID but I want to restrict it at the youtube domain, i don't want it to match the id if the domain differs from youtube.com or youtu.be. Unfortunately I cannot understand this regex to apply the restriction.

I want to match the id only when the domain is :

with http or https at the front (or without)

The above mentioned regex is successfully matching the youtube id of the following examples:

"http://youtu.be/AAAAAAAAA01"
"http://www.youtube.com/embed/watch?feature=player_embedded&v=AAAAAAAAA02"
"http://www.youtube.com/embed/watch?v=AAAAAAAAA03"
"http://www.youtube.com/embed/v=AAAAAAAAA04"
"http://www.youtube.com/watch?feature=player_embedded&v=AAAAAAAAA05"
"http://www.youtube.com/watch?v=AAAAAAAAA06"
"http://www.youtube.com/v/AAAAAAAAA07"
"www.youtu.be/AAAAAAAAA08"
"youtu.be/AAAAAAAAA09"
"http://www.youtube.com/watch?v=i-AAAAAAA14&feature=related"
"http://www.youtube.com/attribution_link?u=/watch?v=AAAAAAAAA15&feature=share&a=9QlmP1yvjcllp0h3l0NwuA"
"http://www.youtube.com/attribution_link?a=fF1CWYwxCQ4&u=/watch?v=AAAAAAAAA16&feature=em-uploademail"
"http://www.youtube.com/attribution_link?a=fF1CWYwxCQ4&feature=em-uploademail&u=/watch?v=AAAAAAAAA17"
"http://www.youtube.com/v/A-AAAAAAA18?fs=1&rel=0"
"http://www.youtube.com/watch/AAAAAAAAA11"

The current code that checks the url right now is:

private const string YoutubeLinkRegex = "(?:.+?)?(?:\\/v\\/|watch\\/|\\?v=|\\&v=|youtu\\.be\\/|\\/v=|^youtu\\.be\\/)([a-zA-Z0-9_-]{11})+";
    private static Regex regexExtractId = new Regex(YoutubeLinkRegex, RegexOptions.Compiled);


    public string ExtractVideoIdFromUrl(string url)
    {
        //extract the id
        var regRes = regexExtractId.Match(url);
        if (regRes.Success)
        {
            return regRes.Groups[1].Value;
        }
        return null;
    }

Upvotes: 13

Views: 17183

Answers (6)

tym32167
tym32167

Reputation: 4881

It is not required to use regular expressions here

var url = @"https://www.youtube.com/watch?v=6QlW4m9xVZY";
var uri = new Uri(url);

// you can check host here => uri.Host <= "www.youtube.com"

var query = HttpUtility.ParseQueryString(uri.Query);
var videoId = query["v"];

// videoId = 6QlW4m9xVZY

Ok, example above is working, when you have v=videoId as parameter. If you have videoId as segment, you can use this:

var url = "http://youtu.be/AAAAAAAAA09";
var uri = new Uri(url);

var videoid = uri.Segments.Last(); // AAAAAAAAA09

Combining all together, we can get

var url = @"https://www.youtube.com/watch?v=Lvcyj1GfpGY&list=PLolZLFndMkSIYef2O64OLgT-njaPYDXqy";
var uri = new Uri(url);

// you can check host here => uri.Host <= "www.youtube.com"

var query = HttpUtility.ParseQueryString(uri.Query);

var videoId = string.Empty;

if (query.AllKeys.Contains("v"))
{
    videoId = query["v"];
}
else
{
    videoId = uri.Segments.Last();
}

Of course, I don't know anything about your requirements, but, I hope it helps.

Upvotes: 34

Matěj Št&#225;gl
Matěj Št&#225;gl

Reputation: 1037

This is my $.02 on the previous answers with added security checks, ensuring that you won't run into the Length cannot be less than zero errors with some edge-case inputs.

    public static string LinkifyYoutube(this string url)
    {
        if (!url.Contains("data-linkified"))
        {
            return "";
        }

        int pos1 = url.IndexOf("<a target=\'_blank\' data-linkified href=\'", StringComparison.Ordinal);
        int pos2 = url.IndexOf("</a>", StringComparison.Ordinal);

        if (pos1 <= -1 || pos2 - pos1 <= 0)
        {
            return "";
        }

        url = url.Substring(pos1, pos2 - pos1);
        url = url.Replace("<a target=\'_blank\' data-linkified href=\'", "");
        url = url.Replace("\'>", "");
        url = url.Replace("</a>", "");

        var zh = url.LastIndexOf("https", StringComparison.Ordinal);

        if (zh <= 0)
        {
            return "";
        }

        url = url.Substring(0, zh);

        Uri uri = null;
        if (!Uri.TryCreate(url, UriKind.Absolute, out uri))
        {
            try
            {
                uri = new UriBuilder("http", url).Uri;
            }
            catch
            {
                return "";
            }
        }

        string host = uri.Host;
        string[] youTubeHosts = { "www.youtube.com", "youtube.com", "youtu.be", "www.youtu.be" };
        if (!youTubeHosts.Contains(host))
        {
            return "";
        }

        var query = HttpUtility.ParseQueryString(uri.Query);

        if (query.AllKeys.Contains("v"))
        {
            return Regex.Match(query["v"], @"^[a-zA-Z0-9_-]{11}$").Value;
        }
        else if (query.AllKeys.Contains("u"))
        {
            return Regex.Match(query["u"], @"/watch\?v=([a-zA-Z0-9_-]{11})").Groups[1].Value;
        }
        else
        {
            var last = uri.Segments.Last().Replace("/", "");
            if (Regex.IsMatch(last, @"^v=[a-zA-Z0-9_-]{11}$"))
            {
                return last.Replace("v=", "");
            }

            string[] segments = uri.Segments;
            if (segments.Length > 2 && segments[segments.Length - 2] != "v/" && segments[segments.Length - 2] != "watch/")
            {
                return "";
            }

            return Regex.Match(last, @"^[a-zA-Z0-9_-]{11}$").Value;
        }
    }

Upvotes: 0

codejockie
codejockie

Reputation: 10864

This should do it:

public static string GetYouTubeId(string url) {
    var regex = @"(?:youtube\.com\/(?:[^\/]+\/.+\/|(?:v|e(?:mbed)?|watch)\/|.*[?&amp;]v=)|youtu\.be\/)([^""&amp;?\/ ]{11})";

    var match = Regex.Match(url, regex);

    if (match.Success)
    {
        return match.Groups[1].Value;
    }

    return url;
  }

Upvotes: 0

dixhom
dixhom

Reputation: 3035

tym32167's answer throws an exception at var uri = new Uri(url); when url doesn't have a scheme, like "www.youtu.be/AAAAAAAAA08".

Besides, wrong videoIds are returned for some urls.

So here's my code based on tym32167's one.

    static private string GetYouTubeVideoIdFromUrl(string url)
    {
        Uri uri = null;
        if (!Uri.TryCreate(url, UriKind.Absolute, out uri))
        {
            try
            {
                uri = new UriBuilder("http", url).Uri;
            }
            catch
            {
                // invalid url
                return "";
            }
        }

        string host = uri.Host;
        string[] youTubeHosts = { "www.youtube.com", "youtube.com", "youtu.be", "www.youtu.be" };
        if (!youTubeHosts.Contains(host))
            return "";

        var query = HttpUtility.ParseQueryString(uri.Query);

        if (query.AllKeys.Contains("v"))
        {
            return Regex.Match(query["v"], @"^[a-zA-Z0-9_-]{11}$").Value;
        }
        else if (query.AllKeys.Contains("u"))
        {
            // some urls have something like "u=/watch?v=AAAAAAAAA16"
            return Regex.Match(query["u"], @"/watch\?v=([a-zA-Z0-9_-]{11})").Groups[1].Value;
        }
        else
        {
            // remove a trailing forward space
            var last = uri.Segments.Last().Replace("/", "");
            if (Regex.IsMatch(last, @"^v=[a-zA-Z0-9_-]{11}$"))
                return last.Replace("v=", "");

            string[] segments = uri.Segments;
            if (segments.Length > 2 && segments[segments.Length - 2] != "v/" && segments[segments.Length - 2] != "watch/")
                return "";

            return Regex.Match(last, @"^[a-zA-Z0-9_-]{11}$").Value;
        }
    }

Let's test it.

        string[] urls = {"http://youtu.be/AAAAAAAAA01",
            "http://www.youtube.com/embed/watch?feature=player_embedded&v=AAAAAAAAA02",
            "http://www.youtube.com/embed/watch?v=AAAAAAAAA03",
            "http://www.youtube.com/embed/v=AAAAAAAAA04",
            "http://www.youtube.com/watch?feature=player_embedded&v=AAAAAAAAA05",
            "http://www.youtube.com/watch?v=AAAAAAAAA06",
            "http://www.youtube.com/v/AAAAAAAAA07",
            "www.youtu.be/AAAAAAAAA08",
            "youtu.be/AAAAAAAAA09",
            "http://www.youtube.com/watch?v=i-AAAAAAA14&feature=related",
            "http://www.youtube.com/attribution_link?u=/watch?v=AAAAAAAAA15&feature=share&a=9QlmP1yvjcllp0h3l0NwuA",
            "http://www.youtube.com/attribution_link?a=fF1CWYwxCQ4&u=/watch?v=AAAAAAAAA16&feature=em-uploademail",
            "http://www.youtube.com/attribution_link?a=fF1CWYwxCQ4&feature=em-uploademail&u=/watch?v=AAAAAAAAA17",
            "http://www.youtube.com/v/A-AAAAAAA18?fs=1&rel=0",
            "http://www.youtube.com/watch/AAAAAAAAA11",};

        Console.WriteLine("***Youtube urls***");
        foreach (string url in urls)
        {
            Console.WriteLine("{0}\n-> {1}", url, GetYouTubeVideoIdFromUrl(url));
        }

        string[] invalidUrls = {
            "ww.youtube.com/v/AAAAAAAAA13",
            "http:/www.youtube.com/v/AAAAAAAAA13",
            "http://www.youtub1e.com/v/AAAAAAAAA13",
            "http://www.vimeo.com/v/AAAAAAAAA13",
            "www.youtube.com/b/AAAAAAAAA13",
            "www.youtube.com/v/AAAAAAAAA1",
            "www.youtube.com/v/AAAAAAAAA1&",
            "www.youtube.com/v/AAAAAAAAA1/",
            ".youtube.com/v/AAAAAAAAA13"};

        Console.WriteLine("***Invalid youtube urls***");
        foreach (string url in invalidUrls)
        {
            Console.WriteLine("{0}\n-> {1}", url, GetYouTubeVideoIdFromUrl(url));
        }

Result (everything's alright)

***Youtube urls***
http://youtu.be/AAAAAAAAA01
-> AAAAAAAAA01
http://www.youtube.com/embed/watch?feature=player_embedded&v=AAAAAAAAA02
-> AAAAAAAAA02
http://www.youtube.com/embed/watch?v=AAAAAAAAA03
-> AAAAAAAAA03
http://www.youtube.com/embed/v=AAAAAAAAA04
-> AAAAAAAAA04
http://www.youtube.com/watch?feature=player_embedded&v=AAAAAAAAA05
-> AAAAAAAAA05
http://www.youtube.com/watch?v=AAAAAAAAA06
-> AAAAAAAAA06
http://www.youtube.com/v/AAAAAAAAA07
-> AAAAAAAAA07
www.youtu.be/AAAAAAAAA08
-> AAAAAAAAA08
youtu.be/AAAAAAAAA09
-> AAAAAAAAA09
http://www.youtube.com/watch?v=i-AAAAAAA14&feature=related
-> i-AAAAAAA14
http://www.youtube.com/attribution_link?u=/watch?v=AAAAAAAAA15&feature=share&a=9QlmP1yvjcllp0h3l0NwuA
-> AAAAAAAAA15
http://www.youtube.com/attribution_link?a=fF1CWYwxCQ4&u=/watch?v=AAAAAAAAA16&feature=em-uploademail
-> AAAAAAAAA16
http://www.youtube.com/attribution_link?a=fF1CWYwxCQ4&feature=em-uploademail&u=/watch?v=AAAAAAAAA17
-> AAAAAAAAA17
http://www.youtube.com/v/A-AAAAAAA18?fs=1&rel=0
-> A-AAAAAAA18
http://www.youtube.com/watch/AAAAAAAAA11
-> AAAAAAAAA11



***Invalid youtube urls***
ww.youtube.com/v/AAAAAAAAA13
-> 
http:/www.youtube.com/v/AAAAAAAAA13
-> 
http://www.youtub1e.com/v/AAAAAAAAA13
-> 
http://www.vimeo.com/v/AAAAAAAAA13
-> 
www.youtube.com/b/AAAAAAAAA13
-> 
www.youtube.com/v/AAAAAAAAA1
-> 
www.youtube.com/v/AAAAAAAAA1&
-> 
www.youtube.com/v/AAAAAAAAA1/
-> 
.youtube.com/v/AAAAAAAAA13
-> 

Upvotes: 4

Menelaos Vergis
Menelaos Vergis

Reputation: 3955

The problem is that the regex cannot check for a string that is required before the mining action and at the same time use this sting as the mining action itself.

For example let's check "http://www.youtu.be/v/AAAAAAAAA07" YouTu.be is mandatory at the beginning of the URL but the mining action is "/v/(11 chars)"

At "http://www.youtu.be/AAAAAAAAA07" the mining action is "youtu.be/(11 chars)"

This cannot be at the same regex and this is why we cannot check for domain and extract the id at the same regex.

I decided to check the domain authority from a list of valid domains and then extract the id from the URL.

 private const string YoutubeLinkRegex = "(?:.+?)?(?:\\/v\\/|watch\\/|\\?v=|\\&v=|youtu\\.be\\/|\\/v=|^youtu\\.be\\/)([a-zA-Z0-9_-]{11})+";
 private static Regex regexExtractId = new Regex(YoutubeLinkRegex, RegexOptions.Compiled);
 private static string[] validAuthorities = { "youtube.com", "www.youtube.com", "youtu.be", "www.youtu.be" };

 public string ExtractVideoIdFromUri(Uri uri)
 {
     try
     {
        string authority = new UriBuilder(uri).Uri.Authority.ToLower();

        //check if the url is a youtube url
        if (validAuthorities.Contains(authority))
        {
            //and extract the id
            var regRes = regexExtractId.Match(uri.ToString());
            if (regRes.Success)
            {
                return regRes.Groups[1].Value;
            }
        }
     }catch{}


     return null;
 }

UriBuilder is preferred because it can understand a wider range of URLs than Uri class. It can create Uri from URLs that doesn't contain scheme such as "youtube.com".

The function is returning null(correctly) with the following test URLs:

"ww.youtube.com/v/AAAAAAAAA13"
"http:/www.youtube.com/v/AAAAAAAAA13"
"http://www.youtub1e.com/v/AAAAAAAAA13"
"http://www.vimeo.com/v/AAAAAAAAA13"
"www.youtube.com/b/AAAAAAAAA13"
"www.youtube.com/v/AAAAAAAAA1"
"www.youtube.com/v/AAAAAAAAA1&"
"www.youtube.com/v/AAAAAAAAA1/"
".youtube.com/v/AAAAAAAAA13"

Upvotes: 11

confusedandamused
confusedandamused

Reputation: 756

As said by septih here

I had a play around with the examples and came up with these: .

Youtube: youtu(?:\.be|be\.com)/(?:.*v(?:/|=)|(?:.*/)?)([a-zA-Z0-9-_]+) And they should match all those given. The (?: ...) means that everything inside the bracket won't be captured. So only the id should be obtained.

Upvotes: 1

Related Questions