lonesome
lonesome

Reputation: 2552

Enhancing regex to match more URLs

Considering this regex:

  static String AdrPattern="(?:http://www\\.([^/&]+)\\.com/|(?!^)\\G)/?([^/]+)";

I have two small questions:

  1. How is it possible to make it to match URLs that only have the domain name, without any further path/segment? (such as https://stackoverflow.com)
  2. How is it possible to make this regex to match URLs with different domain extensions?

P.S: the regex is taken from here and works fine, but these two shortcomings should be fixed.

EDIT

Based on the below code, the answer made to this post will skip the further segments and only prints the domain name:

         static String AdrPattern= "(?:(?!\\A)\\G(?:/([^\\s/]+))|http://www\\.([^\\s/&]+)\\.(?:com|net|gov|org)(?:/([^\\s/]+))?)";
         static Pattern WebUrlPattern = Pattern.compile (AdrPattern);
         WebUrlMatcher= WebUrlPattern.matcher(line);



        int cn=0;
        while(WebUrlMatcher.find()) {

    if (cnt == 0) 
        {
           String extractedPath = WebUrlMatcher.group(1);

           if(extractedPath!=null){

            fop.write(prefix.toLowerCase().getBytes());


            fop.write(System.getProperty("line.separator").getBytes());



            }

  if(extractedPath!=null)
  {
                fop.write(extractedPath.toLowerCase().getBytes());

                fop.write(System.getProperty("line.separator").getBytes());
  }        

       String extractedPart = WebUrlMatcher.group(2);
       String extractedPart = WebUrlMatcher.group(2);
   String extracted2=WebUrlMatcher.group(3);
   if(extractedPart!=null)
   {
            fop.write(extractedPart.toLowerCase().getBytes());       
            fop.write(System.getProperty("line.separator").getBytes());

            if(extracted2!=null)
            {
            fop.write(extracted2.toLowerCase().getBytes());
            fop.write(System.getProperty("line.separator").getBytes());
            }

   cnt = cnt + 1;

   }
}
    }

    }

Upvotes: 0

Views: 72

Answers (1)

user557597
user557597

Reputation:

Here is one way. A slight manipulation of the current regex.
Just test the capture groups.

 "(?:(?!\\A)\\G(?:/([^\\s/]+))|http://www\\.([^\\s/&]+)\\.(?:com|net)(?:/([^\\s/]+))?)"

 (?:
      (?! \A )                      # Not BOS
      \G                            # Start from last match
      (?:
           /  
           ( [^\s/]+ )                   # (1), Required Next Segment path (or fail)
      )
   |                              # or,
      http://www\.                  # New match
      ( [^\s/&]+ )                  # (2), Domain
      \.
      (?: com | net )               # Extension
      (?:
           /  
           ( [^\s/]+ )                   # (3), Optional First Segment path
      )?
 )

Test capture's -

Input:

http://www.asfdasdf.net/  
http://www.asfdasdf.net/first  
http://www.asfdasdf.net/first/second  

Output:

 **  Grp 0 -  ( pos 0 , len 23 ) 
http://www.asfdasdf.net  
 **  Grp 1 -  NULL 
 **  Grp 2 -  ( pos 11 , len 8 ) 
asfdasdf  
 **  Grp 3 -  NULL 

-------------

 **  Grp 0 -  ( pos 28 , len 29 ) 
http://www.asfdasdf.net/first  
 **  Grp 1 -  NULL 
 **  Grp 2 -  ( pos 39 , len 8 ) 
asfdasdf  
 **  Grp 3 -  ( pos 52 , len 5 ) 
first  

-------------

 **  Grp 0 -  ( pos 61 , len 29 ) 
http://www.asfdasdf.net/first  
 **  Grp 1 -  NULL 
 **  Grp 2 -  ( pos 72 , len 8 ) 
asfdasdf  
 **  Grp 3 -  ( pos 85 , len 5 ) 
first  

-------------

 **  Grp 0 -  ( pos 90 , len 7 ) 
/second  
 **  Grp 1 -  ( pos 91 , len 6 ) 
second  
 **  Grp 2 -  NULL 
 **  Grp 3 -  NULL 

Upvotes: 1

Related Questions