Adrian Smith
Adrian Smith

Reputation: 17553

URL auto-detection and highlighting in a block of text

The user may enter text for example

This is some text, visit www.mysite.com. Thanks & bye.

The URL should be found and turned into a link, for display in a website. All other characters should appear as-is.

I have been searching and googling for some time. I'm sure this sort of thing must already exist. My temptation is to program this myself but I'm sure this is more complex than it looks.

I'm sure there are other issues that I will encounter as soon as I attempt to program this myself. I don't think that a simple reg-exp is the way forward.

Is there any library which already does this, ideally for Java? (If it's in another technology maybe I can take a look at it and convert it to Java)

Upvotes: 0

Views: 1122

Answers (2)

Vala
Vala

Reputation: 5674

While you are right that this is a common problem it's also one that isn't really satisfactorily solved anywhere, nor can it be. URIs without markup written in freetext like this can be ambiguous (see http://en.wikisource.org/wiki/1911_Encyclop%C3%A6dia_Britannica/Aga_Khan_I. for example, how would you know that '.' wasn't an "end of sentence" full stop and in fact is part of the URI?). You can have a look at the problem with urls for an introduction to the problem and quite an informative discussion in the comments. At the end of the day you can provide a best effort such as matching protocols, looking for valid top-level domains (which includes a lot more than you might think at first), but there will always be things slipping through the net.

To attempt to provide you with some pseudo-code I'd say something along these lines is what I'd start off with:

process() {
    List<String> looksLikeUri = getMatches(1orMoreValidUriCharacters + "\\." + 1orMoreValidUriCharacters);
    removeUrisWithInvalidTopLevelDomains(looksLikeUri);
    trimCharactersUnlikelyToBeInUris(looksLikeUri);
    guessProtocolIfNotPresent(looksLikeUri);
}

removeUrisWithInvalidTopLevelDomains() // Use a list of valid ones or limit it to something like 1-6 characters.

trimCharactersUnlikelyToBeInUris() // ,.:;? (at the very end) '(' at start ')' at end unless a starting one was in URI.

guessProtocolIfNotPresent() // Usually http unless string starts with something obvious like "ftp" or already has a protocol.

Upvotes: 1

MOleYArd
MOleYArd

Reputation: 1268

It would be probably fully solvable if the contained URL always contained protocol (such as HTTP). Because this is not the case, any "word", which contains . character can potentially be URL (for example mysite.com) and moreover you cannot be sure with teh actual protocol (you may assume).

If you assume that user will be always online, you may make a method that will take all potential URLs, checks if URL exists and if it does, then produce HTML link.

I have wroted this code snippet:

import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ArrayList;
import java.util.regex.*;


public class JavaURLHighlighter
{
    Pattern potentialURLAtTheBeginning = Pattern.compile("^[^\\s]+\\.[^\\s]+\\s");
    Pattern potentialURLintheMiddle = Pattern.compile("\\s[^\\s]+\\.[^\\s]+\\s");
    Pattern potentialURLAtTheEnd = Pattern.compile("\\s[^\\s]+\\.[^\\s]+$");
    private String urlString;
    ArrayList<String> matchesList=new ArrayList<String>();

    public String getUrlString() {
        return urlString;
    }

    public void setUrlString(String urlString) {
        this.urlString = urlString;
    }

    public void getConvertedMatches()
     {
        String match;
        String originalMatch;
        Matcher matcher;
        matcher = potentialURLAtTheBeginning.matcher(urlString);
        matchesList.clear();
        while (matcher.find())
        {
          match = matcher.group().trim();
          if (!match.startsWith("http://") && !match.startsWith("https://")) match = "http://"+match;
          if (match.endsWith(".")) match=match.substring(0, match.length()-1);
          if (urlExists(match)) matchesList.add(match);
        }
        matcher = potentialURLintheMiddle.matcher(urlString);
        while (matcher.find()) 
        {
          match = matcher.group().trim();
          if (!match.startsWith("http://") && !match.startsWith("https://")) match = "http://"+match;
          if (match.endsWith(".")) match=match.substring(0, match.length()-1);
          if (urlExists(match))matchesList.add(match);
        }
        matcher = potentialURLAtTheEnd.matcher(urlString);
        while (matcher.find()) 
        {
          match = matcher.group().trim();
          if (!match.startsWith("http://") && !match.startsWith("https://")) match = "http://"+match;
          if (match.endsWith(".")) match=match.substring(0, match.length()-1);
          if (urlExists(match)) matchesList.add(match);
        }

        for (int i=0; i< matchesList.size();i++) System.out.println(matchesList.get(i));
    }

    public static boolean urlExists(String urlAddress)
    {
        try
        {
          HttpURLConnection.setFollowRedirects(false);
          HttpURLConnection connection = (HttpURLConnection) new URL(urlAddress).openConnection();
          connection.setRequestMethod("HEAD");
          return (connection.getResponseCode() == HttpURLConnection.HTTP_OK);
        }
        catch (Exception e)  {return false;  }
    }

public static void main(String[] args)
{
    JavaURLHighlighter hg = new JavaURLHighlighter();

    hg.setUrlString("This is some text, visit www.mysite.com. Thanks & bye.");
    hg.getConvertedMatches();

    hg.setUrlString("This is some text, visit www.nonexistingmysite.com. Thanks & bye.");
    hg.getConvertedMatches();    

}

}

It's not actual solution to your problem and I wrote it quicky, so it might not be completly correct, but it should guide you a bit. Here I just print the matches. Have a look here Java equivalent to PHP's preg_replace_callback for regexp replacing function with which you could embrace all modified matches with a hrefs. With provided information you should be able to write what you want - but possibly with not 100% reliable detection.

Upvotes: 0

Related Questions