Reputation: 2552
Considering this regex:
static String AdrPattern="(?:http://www\\.([^/&]+)\\.com/|(?!^)\\G)/?([^/]+)";
I have two small questions:
https://stackoverflow.com
)P.S: the regex is taken from here and works fine, but these two shortcomings should be fixed.
EDIT
Based on the below code, the answer made to this post will skip the further segments and only prints the domain name:
static String AdrPattern= "(?:(?!\\A)\\G(?:/([^\\s/]+))|http://www\\.([^\\s/&]+)\\.(?:com|net|gov|org)(?:/([^\\s/]+))?)";
static Pattern WebUrlPattern = Pattern.compile (AdrPattern);
WebUrlMatcher= WebUrlPattern.matcher(line);
int cn=0;
while(WebUrlMatcher.find()) {
if (cnt == 0)
{
String extractedPath = WebUrlMatcher.group(1);
if(extractedPath!=null){
fop.write(prefix.toLowerCase().getBytes());
fop.write(System.getProperty("line.separator").getBytes());
}
if(extractedPath!=null)
{
fop.write(extractedPath.toLowerCase().getBytes());
fop.write(System.getProperty("line.separator").getBytes());
}
String extractedPart = WebUrlMatcher.group(2);
String extractedPart = WebUrlMatcher.group(2);
String extracted2=WebUrlMatcher.group(3);
if(extractedPart!=null)
{
fop.write(extractedPart.toLowerCase().getBytes());
fop.write(System.getProperty("line.separator").getBytes());
if(extracted2!=null)
{
fop.write(extracted2.toLowerCase().getBytes());
fop.write(System.getProperty("line.separator").getBytes());
}
cnt = cnt + 1;
}
}
}
}
Upvotes: 0
Views: 72
Reputation:
Here is one way. A slight manipulation of the current regex.
Just test the capture groups.
"(?:(?!\\A)\\G(?:/([^\\s/]+))|http://www\\.([^\\s/&]+)\\.(?:com|net)(?:/([^\\s/]+))?)"
(?:
(?! \A ) # Not BOS
\G # Start from last match
(?:
/
( [^\s/]+ ) # (1), Required Next Segment path (or fail)
)
| # or,
http://www\. # New match
( [^\s/&]+ ) # (2), Domain
\.
(?: com | net ) # Extension
(?:
/
( [^\s/]+ ) # (3), Optional First Segment path
)?
)
Test capture's -
Input:
http://www.asfdasdf.net/
http://www.asfdasdf.net/first
http://www.asfdasdf.net/first/second
Output:
** Grp 0 - ( pos 0 , len 23 )
http://www.asfdasdf.net
** Grp 1 - NULL
** Grp 2 - ( pos 11 , len 8 )
asfdasdf
** Grp 3 - NULL
-------------
** Grp 0 - ( pos 28 , len 29 )
http://www.asfdasdf.net/first
** Grp 1 - NULL
** Grp 2 - ( pos 39 , len 8 )
asfdasdf
** Grp 3 - ( pos 52 , len 5 )
first
-------------
** Grp 0 - ( pos 61 , len 29 )
http://www.asfdasdf.net/first
** Grp 1 - NULL
** Grp 2 - ( pos 72 , len 8 )
asfdasdf
** Grp 3 - ( pos 85 , len 5 )
first
-------------
** Grp 0 - ( pos 90 , len 7 )
/second
** Grp 1 - ( pos 91 , len 6 )
second
** Grp 2 - NULL
** Grp 3 - NULL
Upvotes: 1