user1295450
user1295450

Reputation: 167

Extracting URL from string

Assume my string is

http://www.test.com\r\nhttp://www.hello.com<some text here>http://www.world.com

I want to extract all URLs in the string. The output should be as follows:

http://www.test.com
http://www.hello.com
http://www.world.com

How can I achieve that?

There is no html tag in the string so extracting them using HTMLAgilityPack is not a viable option.

Upvotes: 0

Views: 3484

Answers (3)

Diego D
Diego D

Reputation: 8152

Among the other answers and comments, the easiest approach I can actually implement is the Split way. You know there is lots of blind guess here and one of the best bet to take it all may be this:

using System.Text.RegularExpressions;

public static List<string> ParseUrls(string input) {
    List<string> urls = new List<string>();
    const string pattern = "http://"; //here you may use a better expression to include ftp and so on
    string[] m = Regex.Split(input, pattern);
    for (int i = 0; i < m.Length; i++)
        if (i % 2 == 0){
            Match urlMatch = Regex.Match(m[i],"^(?<url>[a-zA-Z0-9/?=&.]+)", RegexOptions.Singleline);
            if(urlMatch.Success)
                urls.Add(string.Format("http://{0}", urlMatch.Groups["url"].Value)); //modify the prefix according to the chosen pattern                            
        }
    return urls;
}

Upvotes: 4

Kjartan
Kjartan

Reputation: 19081

You could use the string splitting logic from this question by searching and splitting for/by "http://". If you do need the "http://" part, you could always add it later.

Edit: Note that you would have to search and filter for (things like?) \r\n in at the end of each URL afterwards, but that should not be a big problem...

Upvotes: 0

Vaughan Hilts
Vaughan Hilts

Reputation: 2879

Since ":" is not a valid character in a URL, it can be assumed that when you search for "http://" that you will be given a good, valid start of a URL.

Search for this and find your start.

You could construct a list of known good TLDs you may encounter (this will help: http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains)

You know this will be your ending point; so you can do a search on these from the beginning of the string.

Start from the beginning, and start from this index. Skip everything after it, it's no good.

I'm assuming you have no sub-directories; since you hadn't listed any of them.

Upvotes: 0

Related Questions