Reputation: 457
I have a simple regular expression that matches some URL and it works fine however I'd like to refine it a bit so it excludes a URL containing a certain word.
My Patter: (http:[A-z0-9./~%]+)
IE:
http://maps.google.com/maps
http://www.google.com/flights/gwsredirect
http://slav0nic.org.ua/static/books/python/
http://webcache.googleusercontent.com/search
http://www.python.org/ftp/python/doc/
http://webcache.googleusercontent.com/search
http://www.python.org/ftp/python/
Give the list of URL above matched by my pattern, I'd like to refine my pattern to exclude URL containing the word for example google
I tried using non capturing groups but was unsuccessful, maybe I'm missing something.
Maybe my description wasn't clear.
Okay I have a file of data grabbed from a URL then I use the pattern I've provided with extract the list of links given but as you can see the pattern is returning all links it's doing more than I want it to do. So I want to refine it to not give me links containing a certain word ie: google
Thus after I parse the data instead of returning the list of links above it would instead return the following:
http://slav0nic.org.ua/static/books/python/
http://www.python.org/ftp/python/doc/
http://www.python.org/ftp/python/
All help are appreciated, thank you!
Upvotes: 3
Views: 5537
Reputation: 77454
Try this:
(http:(?![^"\s]*google)[^"\s]+)["\s]
The key difference to the solutions posted earlier is that I control the length of the match for searching.
Upvotes: 2
Reputation: 6547
Try this:
(http:(?!.*google).*)
Source: similar questions
EDIT: (this works, tested it)
public static void main( String[] args ) {
final Pattern p = Pattern.compile( "(http:(?!.*google).*)" );
final String[] in = new String[]{
"http://maps.google.com/maps",
"http://www.google.com/flights/gwsredirect",
"http://slav0nic.org.ua/static/books/python/",
"http://webcache.googleusercontent.com/search",
"http://www.python.org/ftp/python/doc/",
"http://webcache.googleusercontent.com/search",
"http://www.python.org/ftp/python/",
};
for ( final String s : in ) {
final Matcher m = p.matcher( s );
System.out.print( s );
if ( m.find() ) {
System.out.println( " true" );
} else {
System.out.println( " false" );
}
}
}
OUTPUT:
http://maps.google.com/maps false
http://www.google.com/flights/gwsredirect false
http://slav0nic.org.ua/static/books/python/ true
http://webcache.googleusercontent.com/search false
http://www.python.org/ftp/python/doc/ true
http://webcache.googleusercontent.com/search false
http://www.python.org/ftp/python/ true
Upvotes: 1
Reputation: 121702
Modify your regex to capture the hostname and use .contains()
:
public final class TestMatch
{
private static final List<String> urls = Arrays.asList(
"http://maps.google.com/maps",
"http://www.google.com/flights/gwsredirect",
"http://slav0nic.org.ua/static/books/python/",
"http://webcache.googleusercontent.com/search",
"http://www.python.org/ftp/python/doc/",
"http://webcache.googleusercontent.com/search",
"http://www.python.org/ftp/python/"
);
private static final Pattern p
= Pattern.compile("^http://([^/]+)/");
private static final int TRIES = 50000;
public static void main(final String... args)
{
for (final String url: urls)
System.out.printf("%s: %b\n", url, regexIsOK(url));
long start, end;
start = System.currentTimeMillis();
for (int i = 0; i < TRIES; i++)
for (final String url: urls)
regexIsOK(url);
end = System.currentTimeMillis();
System.out.println("Time taken: " + (end - start) + " ms");
System.exit(0);
}
private static boolean regexIsOK(final String url)
{
final Matcher m = p.matcher(url);
return m.find() && !m.group(1).contains("google");
}
}
Sample output:
http://maps.google.com/maps: false
http://www.google.com/flights/gwsredirect: false
http://slav0nic.org.ua/static/books/python/: true
http://webcache.googleusercontent.com/search: false
http://www.python.org/ftp/python/doc/: true
http://webcache.googleusercontent.com/search: false
http://www.python.org/ftp/python/: true
Time taken: 258 ms
Upvotes: 0