Reputation: 598
I have a list of URL in a large file (20 mb), and I have a set of keywords. If the set of keywords matches the url then I want to extract the URL.
Example:keyword="contact" URL:http://www.365media.com/offices-and-contact.html
I need a regular expression to match the keywords with my list of URLs.
My Java code:
public class FileRead {
public static void main(String[] ags) throws FileNotFoundException
{
Scanner in=new Scanner(new File("D:\\Log\\Links.txt"));
String input;
String[] reg=new String[]{".*About.*",".*Available.*",".*Author.*",".*Blog.*",".*Business.*",
".*Career.*",".*category.*",".*City.*",".*Company.*",".*Contain.*",".*Contact.*",".*Download.*",
".*Email.*"};
while(in.hasNext())
{
input=in.nextLine();
//for(String s:reg)
patternFind(input,".*email.*");
}
}
public static void patternFind(String input,String reg)
{
Pattern p=Pattern.compile(reg);
Matcher m=p.matcher(input);
while(m.find())
System.out.println(m.group());
}
}
Upvotes: 0
Views: 1681
Reputation: 56809
I'm going to give a bit general solution. I think you should be able to adapt the idea to your code.
Supposed you have a list of bare keywords in a file and you read it into a String[]
, or you hard-code the list of keywords in a String[]
, for example:
String keywords[] = {"about", "available", "email"};
For all the keywords, use Pattern.quote()
to make sure they are recognized as literal string. Then concatenate the keywords with bar character |
as separator (OR), and surround everything with parentheses ()
. The end result will be like this. Alternatively, you can look at the keywords yourself and write the regex without the quoting \Q
and \E
. You can also just ignore the Pattern.quote()
step, if you are sure that the keywords do not contain regex.
(\Qabout\E|\Qavailable\E|\Qemail\E)
Add .*
to 2 ends to make it matches the rest of the URL, plus (?i)
at the beginning to enable case-insensitive match.
(?i).*(\Qabout\E|\Qavailable\E|\Qemail\E).*
Then you can compile the Pattern
and call matcher(inputString).matches()
on each line of input to check whether the URL has the keyword.
If the keyword is too common in a URL, such as "com", "net", "www", and you want to make the search more fine grain, more tweaking must be done.
Upvotes: 0
Reputation: 26799
Why you can't do this:
For all line (URLs) in the file check if some of your pattern works on the URL
the code is pretty obvious
Upvotes: 1
Reputation: 848
If you only want to match for the existence of any Keyword in the current line, you can simply use
for (String s: reg) {
if (input.contains(s)) {
// do something
}
}
instead of patternFind(input,".email.");
Anyways, a regular expression equivalent to match any of the words would be:
.*(About|Available|Author|And|So|On...).*
I'm not sure which one is faster. String.contains() is simpler, a Pattern is precompiled which could perform better when applied many times, as it is the case here.
Upvotes: 1