Reputation: 337
I have some code that takes in a URL, reads through the file and searches for Strings that match a given regular expression and adds any matches to an arrayList until it reaches the end of the file. How can I modify my code so that while reading through the file, I can check for other Strings matching other regular expressions on the same pass rather than having to read the file multiple times checking for each different regex?
//Pattern currently being checked for
Pattern name = Pattern.compile("<a id=.dg__ct(.+?)_hpl1.>(.+?)</a>");
//Pattern I want to check for as well, currently not implemented
Pattern date = Pattern.compile("[0-9]{2}/[0-9]{2}/[0-9]{4}");
Matcher m;
InputStream inputStream = null;
arrayList = new ArrayList<String>();
try {
URL url = new URL(
"URL to be read");
inputStream = (InputStream) url.getContent();
} catch (Exception e) {
e.printStackTrace();
} finally {
InputStreamReader isr = new InputStreamReader(inputStream);
BufferedReader buf = new BufferedReader(isr);
String str = null;
String s = null;
try {
while ((str = buf.readLine()) != null) {
m = name.matcher(str);
while(m.find()){
s = m.group();
arrayList.add(s);
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
Upvotes: 3
Views: 5573
Reputation: 36229
From 2 Matchers on, you should use a List. And you shouldn't do it in the finally block, which is entered, if one of the streams fails. Instead, the finally block should be used to close the resources.
List <Pattern> patterns = new ArrayList <Pattern> ();
//Pattern currently being checked for
patterns.add (Pattern.compile ("<a id=.dg__ct(.+?)_hpl1.>(.+?)</a>"));
//Pattern I want to check for as well, currently not implemented
patterns.add (Pattern.compile ("[0-9]{2}/[0-9]{2}/[0-9]{4}"));
BufferedReader buf = null;
List <String> matches = new ArrayList <String> ();
try {
URL url = new URL ("URL to be read");
InputStream inputStream = (InputStream) url.getContent ();
InputStreamReader isr = new InputStreamReader (inputStream);
buf = new BufferedReader (isr);
String str = null;
while ((str = buf.readLine ()) != null)
{
for (Pattern p : patterns)
{
Matcher m = p.matcher (str);
while (m.find ())
matches.add (m.group ());
}
}
}
catch (Exception e)
{
e.printStackTrace();
}
finally
{
if (buf != null)
try { buf.close (); } catch (IOException ignored) { /*empty*/}
}
Not corrected in the code: Instead of 'Exception', you should enumerate the specific exceptions. And Matcher is just used inside the innermost loop, so declare it there, not in a bigger scope. A small scope makes it easy to reason about the usage of a variable.
I'm not sure whether the util.Scanner.class can be used to make reading from an Url more easy. Have a look at the documentation.
Upvotes: 6
Reputation: 62165
Create two Matcher
objects
//Pattern currently being checked for
Matcher nameMatcher = Pattern.compile("<a id=.dg__ct(.+?)_hpl1.>(.+?)</a>").matcher();
//Pattern I want to check for as well, currently not implemented
Matcher dateMatcher = Pattern.compile("[0-9]{2}/[0-9]{2}/[0-9]{4}").matcher();
// other stuff...
Check the read string against each matcher
while ((str = buf.readLine()) != null) {
nameMatcher.reset(str);
while(nameMatcher.find()){
s = nameMatcher.group();
arrayList.add(s);
}
dateMatcher.reset(str);
while(nameMatcher.find()){
s = nameMatcher.group();
arrayList.add(s);
}
}
Important
Use reset(Charsequence)
instead of allocation a new Matcher object every time.
Upvotes: 1
Reputation: 75906
Simply obtain a new matcher for the other pattern
Matcher m2 = date.matcher(str);
... // do whatever you want to do with this pattern match
BTW, it's not really a extremely good idea, in general, to parse HTML with regular expressions. (ob. link, by Assistant Don't Parse HTML With Regex Officer in charge)
Upvotes: 1
Reputation: 7739
Instead of using a regular expression, use a java library which understands how to parse HTML properly.
For example, check out the answers for: Java HTML Parsing
Upvotes: 2