TheGeekZn
TheGeekZn

Reputation: 3914

c# Parse URL in text

I have a sentence that may contain URL's. I need to take any URL in uppercase that starts with WWW., and append HTTP://. I have tried the following:

    private string ParseUrlInText(string text)
    {
        string currentText = text;

        foreach (string word in currentText.Split(new[] { "\r\n", "\n", " ", "</br>" }, StringSplitOptions.RemoveEmptyEntries))
        {
            string thing;
            if (word.ToLower().StartsWith("www."))
            {
                if (IsAllUpper(word))
                {
                    thing = "HTTP://" + word;

                    currentText = ReplaceFirst(currentText, word, thing);
                }
            }
        }

        return currentText;
    }

    public string ReplaceFirst(string text, string search, string replace)
    {
        int pos = text.IndexOf(search);
        if (pos < 0)
        {
            return text;
        }
        return text.Substring(0, pos) + replace + text.Substring(pos + search.Length);
    }

    private static bool IsAllUpper(string input)
    {
        return input.All(t => !Char.IsLetter(t) || Char.IsUpper(t));
    }

However its only appending multiple HTTP:// to the first URL using the following:

WWW.GOOGLE.CO.ZA
WWW.GOOGLE.CO.ZA WWW.GOOGLE.CO.ZA
HTTP:// WWW.GOOGLE.CO.ZA
there are a lot of domains (This shouldn't be parsed)

to

HTTP:// WWW.GOOGLE.CO.ZA
HTTP:// WWW.GOOGLE.CO.ZA HTTP:// WWW.GOOGLE.CO.ZA
HTTP:// WWW.GOOGLE.CO.ZA
there are a lot of domains (This shouldn't be parsed)

Please could someone show me the proper way to do this

Edit: I need to keep the format of the string (Spaces, newlines etc)
Edit2: A url might have an HTTP:// appended. I've updated the demo.

Upvotes: 0

Views: 544

Answers (2)

Kilazur
Kilazur

Reputation: 3188

The issue with your code: you're using a ReplaceFirst method, which does exactly what it's meant to: it replaces the first occurence, which is obviously not always the one you want to replace. This is why only your first WWW.GOOGLE.CO.ZA get all the appending of HTTP://.

One method would be to use a StreamReader or something, and each time you get to a new word, you check if it's four first characters are "WWW." and insert at this position of the reader the string "HTTP://". But it's pretty heavy lenghted for something that can be way shorter...

So let's go Regex!

How to insert characters before a word with Regex

Regex.Replace(input, @"[abc]", "adding_text_before_match$1");

How to match words not starting with another word:

(?<!wont_start_with_that)word_to_match

Which leads us to:

private string ParseUrlInText(string text)
{
    return Regex.Replace(text, @"(?<!HTTP://)(WWW\.[A-Za-z0-9_\.]+)",
        @"HTTP://$1");
}

Upvotes: 2

Nefarion
Nefarion

Reputation: 881

I'd go for the following:

1) You don't handle same elements twice,
2) You replace all instances once

private string ParseUrlInText(string text)
{
    string currentText = text;
    var workingText = currentText.Split(new[] { "\r\n", "\n", " ", "</br>" }, 
                          StringSplitOptions.RemoveEmptyEntries).Distinct() // .Distinct() gives us just unique entries!
    foreach (string word in workingText)
    {
        string thing;
        if (word.ToLower().StartsWith("www."))
        {
            if (IsAllUpper(word))
            {
                thing = "HTTP://" + word;

                currentText = currentText.Replace("\r\n" + word, "\r\n" + thing)
                                         .Replace("\n" + word, "\n" + thing)
                                         .Replace(" " + word, " " + thing)
                                         .Replace("</br>" + word, "</br>" + thing)
            }
        }
    }

    return currentText;
}

Upvotes: 0

Related Questions