Reputation: 11
I'm doing an app which should take the whole website-html text and put it into the string. Then i wan to use System.out.println to show one, certain fragment of that string. My code
import java.net.*;
import java.io.*;
public class URLConnectionReader {
public static void main(String[] args) throws Exception {
URL oracle = new URL("www.example-blahblahblah.com");
BufferedReader in = new BufferedReader(
new InputStreamReader(oracle.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine.substring(inputLine.indexOf("<section class=\"horoscope-content\"><p>")+1, inputLine.lastIndexOf("</p")));
in.close();
}
}
It's supposed to show me text typed below:
<section class="horoscope-content">
<p>Text text text text</p>
Instead of that I'm having this:
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(Unknown Source) at URLConnectionReader.main(URLConnectionReader.java:14)
What should i do?
Upvotes: 0
Views: 107
Reputation: 53
Your code reassigns inputLine every time it checks the condition in the while statement, depending on the HTML, you might want to read in the entire file before looking for the section of markup.
Unless you are positive that the HTML contains those sections of text, you are still going to get exceptions when it doesn't exist.
You also only increased the index by 1 for the start, if you don't want the begining text output, you will have to increase by the length of the beginning section.
You can try something like this:
StringBuilder html = new StringBuilder(); //holds all of the html we read
String inputLine;
while ((inputLine = in.readLine()) != null) { //read line by line
html.append(inputLine); //add line to html
}
inputLine = html.toString(); //get
String startText = "<section class=\"horoscope-content\"><p>"; //starting tag
int start = inputLine.indexOf(startText);
int end = inputLine.lastIndexOf("</p"); //might want to use something like inputLine.indexOf("</p>", start); if there are multiple sections on the page
if(start >= 0 && end >= 0) { //make sure we found a section
System.out.println(inputLine.substring(start+startText.length(), end)); //print everything between the start and end tags (excluding the text in the start tag)
} else {
System.out.println("section not found"); //do something else since we didn't find the tags
}
Upvotes: 0
Reputation: 24454
You should use a more tolerant regular expression instead of indexOf
to be more stable concerning minor modifications of the input:
Pattern pattern = Pattern.compile("<section\\s+class\\s*=\\s*\"horoscope-content\"\\s*>\\s*<p>(.*?)</p>", Pattern.DOTALL);
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
System.out.println(matcher.group());
System.out.println("Text in paragraph: " + matcher.group(1));
}
This will be tolerant concerning line breaks and other whitespace characters.
Upvotes: 1