Kishore Karunakaran
Kishore Karunakaran

Reputation: 598

Regular expression keyword match in URL

I have a list of URL in a large file (20 mb), and I have a set of keywords. If the set of keywords matches the url then I want to extract the URL.

Example:keyword="contact" URL:http://www.365media.com/offices-and-contact.html

I need a regular expression to match the keywords with my list of URLs.

My Java code:

public class FileRead {

    public static void main(String[] ags) throws FileNotFoundException
    {
        Scanner in=new Scanner(new File("D:\\Log\\Links.txt"));
        String input;
        String[] reg=new String[]{".*About.*",".*Available.*",".*Author.*",".*Blog.*",".*Business.*",
    ".*Career.*",".*category.*",".*City.*",".*Company.*",".*Contain.*",".*Contact.*",".*Download.*",
    ".*Email.*"};
        while(in.hasNext())
        {
            input=in.nextLine();
            //for(String s:reg)
                patternFind(input,".*email.*");
        }

    }
    public static void patternFind(String input,String reg)
    {
        Pattern p=Pattern.compile(reg);
            Matcher m=p.matcher(input);
            while(m.find())
                System.out.println(m.group());
    }
}

Upvotes: 0

Views: 1681

Answers (3)

nhahtdh
nhahtdh

Reputation: 56809

I'm going to give a bit general solution. I think you should be able to adapt the idea to your code.

Supposed you have a list of bare keywords in a file and you read it into a String[], or you hard-code the list of keywords in a String[], for example:

String keywords[] = {"about", "available", "email"};

For all the keywords, use Pattern.quote() to make sure they are recognized as literal string. Then concatenate the keywords with bar character | as separator (OR), and surround everything with parentheses (). The end result will be like this. Alternatively, you can look at the keywords yourself and write the regex without the quoting \Q and \E. You can also just ignore the Pattern.quote() step, if you are sure that the keywords do not contain regex.

(\Qabout\E|\Qavailable\E|\Qemail\E)

Add .* to 2 ends to make it matches the rest of the URL, plus (?i) at the beginning to enable case-insensitive match.

(?i).*(\Qabout\E|\Qavailable\E|\Qemail\E).*

Then you can compile the Pattern and call matcher(inputString).matches() on each line of input to check whether the URL has the keyword.

If the keyword is too common in a URL, such as "com", "net", "www", and you want to make the search more fine grain, more tweaking must be done.

Upvotes: 0

lukastymo
lukastymo

Reputation: 26799

Why you can't do this:

For all line (URLs) in the file check if some of your pattern works on the URL

the code is pretty obvious

Upvotes: 1

Roben
Roben

Reputation: 848

If you only want to match for the existence of any Keyword in the current line, you can simply use

for (String s: reg) {
  if (input.contains(s)) {
    // do something
  }
}

instead of patternFind(input,".email.");

Anyways, a regular expression equivalent to match any of the words would be:

.*(About|Available|Author|And|So|On...).*

I'm not sure which one is faster. String.contains() is simpler, a Pattern is precompiled which could perform better when applied many times, as it is the case here.

Upvotes: 1

Related Questions