Jeremy Hunts
Jeremy Hunts

Reputation: 363

Split raw html String to lines again in Jsoup

So I extracted the raw html code from a website, but it was all put in one string, I want to split it into lines just like the "view page source" on google chrome.

This is my code.

String url = "https://stratechery.com/2016/how-google-cloud-platform-is-challenging-aws/"; //crawl(url," more Complete Footwear.txt",9000);

    System.out.println(br2nl(url));
    Document doc = Jsoup.connect(url)
            .data("query", "Java")
            .userAgent("Mozilla")
            .cookie("auth", "token")
            .timeout(3000)
            .post();
    String rawhtml =doc.toString();
     String lines[] = rawhtml.split("\""+" ");

I tried to split the "rawhtml" string based on quotes and spaces but they are all over every line so it made splits everywhere.

Upvotes: 0

Views: 605

Answers (1)

Tim
Tim

Reputation: 4274

I think you might be missing the point of Jsoup.

You don't have to do the parsing yourself line by line, Jsoup has methods to do that. The HTML is already parsed in the JSOUP Document you created. You can now access its elements one by one, or in a grouped fashion. The possibilities are endless, take a look at the official docs: https://jsoup.org/cookbook/

To answer your question nonetheless, to split the whole HTML String by newlines, you could do this:

public class JsoupTest {

  public static void main(String[] args) throws IOException {

    String url = "https://stratechery.com/2016/how-google-cloud-platform-is-challenging-aws/";

    Document doc = Jsoup.connect(url)
        .userAgent("Mozilla")
        .get();

    for (String s : doc.toString().split("\\n")) {
      System.out.println(s);
    }
  }
}

Upvotes: 1

Related Questions