Sinista
Sinista

Reputation: 457

Java Regex URL Matching

I have a simple regular expression that matches some URL and it works fine however I'd like to refine it a bit so it excludes a URL containing a certain word.

My Patter: (http:[A-z0-9./~%]+)

IE:

http://maps.google.com/maps
http://www.google.com/flights/gwsredirect
http://slav0nic.org.ua/static/books/python/
http://webcache.googleusercontent.com/search
http://www.python.org/ftp/python/doc/
http://webcache.googleusercontent.com/search
http://www.python.org/ftp/python/

Give the list of URL above matched by my pattern, I'd like to refine my pattern to exclude URL containing the word for example google

I tried using non capturing groups but was unsuccessful, maybe I'm missing something.

ADDITIONAL INFORMATION

Maybe my description wasn't clear.

Okay I have a file of data grabbed from a URL then I use the pattern I've provided with extract the list of links given but as you can see the pattern is returning all links it's doing more than I want it to do. So I want to refine it to not give me links containing a certain word ie: google

Thus after I parse the data instead of returning the list of links above it would instead return the following:

http://slav0nic.org.ua/static/books/python/
http://www.python.org/ftp/python/doc/
http://www.python.org/ftp/python/

enter image description here

All help are appreciated, thank you!

Upvotes: 3

Views: 5537

Answers (3)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

Try this:

(http:(?![^"\s]*google)[^"\s]+)["\s]

The key difference to the solutions posted earlier is that I control the length of the match for searching.

Upvotes: 2

mana
mana

Reputation: 6547

Try this:

(http:(?!.*google).*)

Source: similar questions

EDIT: (this works, tested it)

public static void main( String[] args ) {

    final Pattern p = Pattern.compile( "(http:(?!.*google).*)" );
    final String[] in = new String[]{
        "http://maps.google.com/maps",
        "http://www.google.com/flights/gwsredirect",
        "http://slav0nic.org.ua/static/books/python/",
        "http://webcache.googleusercontent.com/search",
        "http://www.python.org/ftp/python/doc/",
        "http://webcache.googleusercontent.com/search",
        "http://www.python.org/ftp/python/",
    };

    for ( final String s : in ) {    
      final Matcher m = p.matcher( s );
      System.out.print( s );
      if ( m.find() ) {
        System.out.println( " true" );
      } else {
        System.out.println( " false" );
      }
    }
}

OUTPUT:

http://maps.google.com/maps false
http://www.google.com/flights/gwsredirect false
http://slav0nic.org.ua/static/books/python/ true
http://webcache.googleusercontent.com/search false
http://www.python.org/ftp/python/doc/ true
http://webcache.googleusercontent.com/search false
http://www.python.org/ftp/python/ true

Upvotes: 1

fge
fge

Reputation: 121702

Modify your regex to capture the hostname and use .contains():

public final class TestMatch
{
    private static final List<String> urls = Arrays.asList(
        "http://maps.google.com/maps",
        "http://www.google.com/flights/gwsredirect",
        "http://slav0nic.org.ua/static/books/python/",
        "http://webcache.googleusercontent.com/search",
        "http://www.python.org/ftp/python/doc/",
        "http://webcache.googleusercontent.com/search",
        "http://www.python.org/ftp/python/"
    );

    private static final Pattern p
        = Pattern.compile("^http://([^/]+)/");

    private static final int TRIES = 50000;

    public static void main(final String... args)
    {
        for (final String url: urls)
            System.out.printf("%s: %b\n", url, regexIsOK(url));

        long start, end;

        start = System.currentTimeMillis();
        for (int i = 0; i < TRIES; i++)
            for (final String url: urls)
                regexIsOK(url);
        end = System.currentTimeMillis();

        System.out.println("Time taken: " + (end - start) + " ms");
        System.exit(0);
    }

    private static boolean regexIsOK(final String url)
    {
        final Matcher m = p.matcher(url);

        return m.find() && !m.group(1).contains("google");
    }
}

Sample output:

http://maps.google.com/maps: false
http://www.google.com/flights/gwsredirect: false
http://slav0nic.org.ua/static/books/python/: true
http://webcache.googleusercontent.com/search: false
http://www.python.org/ftp/python/doc/: true
http://webcache.googleusercontent.com/search: false
http://www.python.org/ftp/python/: true
Time taken: 258 ms

Upvotes: 0

Related Questions