oipsl
oipsl

Reputation: 337

Using multiple regex to scan a file

I have some code that takes in a URL, reads through the file and searches for Strings that match a given regular expression and adds any matches to an arrayList until it reaches the end of the file. How can I modify my code so that while reading through the file, I can check for other Strings matching other regular expressions on the same pass rather than having to read the file multiple times checking for each different regex?

    //Pattern currently being checked for
    Pattern name = Pattern.compile("<a id=.dg__ct(.+?)_hpl1.>(.+?)</a>");

    //Pattern I want to check for as well, currently not implemented
    Pattern date = Pattern.compile("[0-9]{2}/[0-9]{2}/[0-9]{4}");

    Matcher m;
    InputStream inputStream = null;
    arrayList = new ArrayList<String>();
    try {
        URL url = new URL(
                "URL to be read");
        inputStream = (InputStream) url.getContent();
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        InputStreamReader isr = new InputStreamReader(inputStream);
        BufferedReader buf = new BufferedReader(isr);
        String str = null;
        String s = null;

        try {
            while ((str = buf.readLine()) != null) {

                m = name.matcher(str);
                while(m.find()){
                    s = m.group();
                    arrayList.add(s);
                }

            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

Upvotes: 3

Views: 5573

Answers (4)

user unknown
user unknown

Reputation: 36229

From 2 Matchers on, you should use a List. And you shouldn't do it in the finally block, which is entered, if one of the streams fails. Instead, the finally block should be used to close the resources.

    List <Pattern> patterns = new ArrayList <Pattern> ();
    //Pattern currently being checked for
    patterns.add (Pattern.compile ("<a id=.dg__ct(.+?)_hpl1.>(.+?)</a>"));
    //Pattern I want to check for as well, currently not implemented
    patterns.add (Pattern.compile ("[0-9]{2}/[0-9]{2}/[0-9]{4}"));
    BufferedReader buf = null;
    List <String> matches = new ArrayList <String> ();
    try {
        URL url = new URL ("URL to be read");
        InputStream inputStream = (InputStream) url.getContent ();
        InputStreamReader isr = new InputStreamReader (inputStream);
        buf = new BufferedReader (isr);
        String str = null;
        while ((str = buf.readLine ()) != null) 
        {
            for (Pattern p : patterns) 
            {
                Matcher m = p.matcher (str);
                while (m.find ()) 
                    matches.add (m.group ());
            }
        }       
    } 
    catch (Exception e) 
    {
        e.printStackTrace();
    }
    finally  
    {
        if (buf != null) 
            try { buf.close (); } catch (IOException ignored) { /*empty*/}
    }

Not corrected in the code: Instead of 'Exception', you should enumerate the specific exceptions. And Matcher is just used inside the innermost loop, so declare it there, not in a bigger scope. A small scope makes it easy to reason about the usage of a variable.

I'm not sure whether the util.Scanner.class can be used to make reading from an Url more easy. Have a look at the documentation.

Upvotes: 6

helpermethod
helpermethod

Reputation: 62165

  1. Create two Matcher objects

    //Pattern currently being checked for
    Matcher nameMatcher = Pattern.compile("<a id=.dg__ct(.+?)_hpl1.>(.+?)</a>").matcher();
    
    //Pattern I want to check for as well, currently not implemented
    Matcher dateMatcher = Pattern.compile("[0-9]{2}/[0-9]{2}/[0-9]{4}").matcher();
    
    
    // other stuff...
    
  2. Check the read string against each matcher

    while ((str = buf.readLine()) != null) {
    
            nameMatcher.reset(str);
    
            while(nameMatcher.find()){
                s = nameMatcher.group();
                arrayList.add(s);
            }
    
            dateMatcher.reset(str);
    
            while(nameMatcher.find()){
                s = nameMatcher.group();
                arrayList.add(s);
            }
        }
    

Important

Use reset(Charsequence) instead of allocation a new Matcher object every time.

Upvotes: 1

leonbloy
leonbloy

Reputation: 75906

Simply obtain a new matcher for the other pattern

   Matcher m2 = date.matcher(str);
   ... // do whatever you want to do with this pattern match

BTW, it's not really a extremely good idea, in general, to parse HTML with regular expressions. (ob. link, by Assistant Don't Parse HTML With Regex Officer in charge)

Upvotes: 1

RMorrisey
RMorrisey

Reputation: 7739

Instead of using a regular expression, use a java library which understands how to parse HTML properly.

For example, check out the answers for: Java HTML Parsing

Upvotes: 2

Related Questions