user3019431
user3019431

Reputation: 11

A certain string part needed

I'm doing an app which should take the whole website-html text and put it into the string. Then i wan to use System.out.println to show one, certain fragment of that string. My code

import java.net.*;
import java.io.*;

public class URLConnectionReader {
    public static void main(String[] args) throws Exception {

        URL oracle = new URL("www.example-blahblahblah.com");
        BufferedReader in = new BufferedReader(
        new InputStreamReader(oracle.openStream()));

        String inputLine;
        while ((inputLine = in.readLine()) != null)

       System.out.println(inputLine.substring(inputLine.indexOf("<section class=\"horoscope-content\"><p>")+1, inputLine.lastIndexOf("</p")));

        in.close();
    }
}

It's supposed to show me text typed below:

<section class="horoscope-content">
    <p>Text text text text</p>

Instead of that I'm having this:

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(Unknown Source) at URLConnectionReader.main(URLConnectionReader.java:14)

What should i do?

Upvotes: 0

Views: 107

Answers (2)

Derek M
Derek M

Reputation: 53

Your code reassigns inputLine every time it checks the condition in the while statement, depending on the HTML, you might want to read in the entire file before looking for the section of markup.
Unless you are positive that the HTML contains those sections of text, you are still going to get exceptions when it doesn't exist.
You also only increased the index by 1 for the start, if you don't want the begining text output, you will have to increase by the length of the beginning section.

You can try something like this:

StringBuilder html = new StringBuilder(); //holds all of the html we read
String inputLine;
while ((inputLine = in.readLine()) != null) { //read line by line
  html.append(inputLine); //add line to html
}
inputLine = html.toString(); //get 
String startText = "<section class=\"horoscope-content\"><p>"; //starting tag
int start = inputLine.indexOf(startText);
int end = inputLine.lastIndexOf("</p"); //might want to use something like inputLine.indexOf("</p>", start); if there are multiple sections on the page
if(start >= 0 && end >= 0) { //make sure we found a section
  System.out.println(inputLine.substring(start+startText.length(), end)); //print everything between the start and end tags (excluding the text in the start tag)
} else {
  System.out.println("section not found"); //do something else since we didn't find the tags
}

Upvotes: 0

isnot2bad
isnot2bad

Reputation: 24454

You should use a more tolerant regular expression instead of indexOf to be more stable concerning minor modifications of the input:

Pattern pattern = Pattern.compile("<section\\s+class\\s*=\\s*\"horoscope-content\"\\s*>\\s*<p>(.*?)</p>", Pattern.DOTALL);
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
    System.out.println(matcher.group());
    System.out.println("Text in paragraph: " + matcher.group(1));
}

This will be tolerant concerning line breaks and other whitespace characters.

Upvotes: 1

Related Questions