pek
pek

Reputation: 18035

How to fetch HTML in Java

Without the use of any external library, what is the simplest way to fetch a website's HTML content into a String?

Upvotes: 35

Views: 78945

Answers (7)

piero B
piero B

Reputation: 47

Well it depends on what you're expecting to do with the fetched html string. If your goal is to do some kind of parsing or any kind of data extracting from the html content, why refrain yourself from using an external library?

Jsoup does the whole job very well without having to write a single regex yourself.

For example, to get the page title ( <head><title>this one</title>... ) you only need these few lines of code:

 String url = "https://www.example.com";
 Document document = Jsoup.connect(url).get();
 String title = document.title();

To use Jsoup you just have to add the dependency to your pom.xml file (make sure to pick the right version for the JDK your running):

    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.18.1</version>
    </dependency>

With the Jsoup document created on the second line of the above example, you can access any DOM element with css like selectors. For instance, this will print the URLs of every image in the page:

  document.select("img")
    .forEach(element -> System.out.println(element.attr("src")));

You can access the raw html string if you really need to: String rawHtml = document.html();

I am also often tempted not to use any external library, but I am very glad I did it for this one. Straight forward, simple to use and very comprehensive.

Upvotes: 1

Dheeraj Mukharjee
Dheeraj Mukharjee

Reputation: 1

 try {
        URL u = new URL("https"+':'+'/'+'/'+"www.Samsung.com"+'/'+"in"+'/');
        URLConnection urlconnect = u.openConnection();
        InputStream stream = urlconnect.getInputStream();
        int i;
        while ((i = stream.read()) != -1) {
            System.out.print((char)i);
        }
    }
    catch (Exception e) {
        System.out.println(e);
    }

Upvotes: 0

pek
pek

Reputation: 18035

I'm currently using this:

String content = null;
URLConnection connection = null;
try {
  connection =  new URL("http://www.google.com").openConnection();
  Scanner scanner = new Scanner(connection.getInputStream());
  scanner.useDelimiter("\\Z");
  content = scanner.next();
  scanner.close();
}catch ( Exception ex ) {
    ex.printStackTrace();
}
System.out.println(content);

But not sure if there's a better way.

Upvotes: 48

dinesh kandpal
dinesh kandpal

Reputation: 775

Its not library but a tool named curl generally installed in most of the servers or you can easily install in ubuntu by

sudo apt install curl

Then fetch any html page and store it to your local file like an example

curl https://www.facebook.com/ > fb.html

You will get the home page html.You can run it in your browser as well.

Upvotes: -4

Scott Bennett-McLeish
Scott Bennett-McLeish

Reputation: 9287

Whilst not vanilla-Java, I'll offer up a simpler solution. Use Groovy ;-)

String siteContent = new URL("http://www.google.com").text

Upvotes: 2

Scott Bennett-McLeish
Scott Bennett-McLeish

Reputation: 9287

This has worked well for me:

URL url = new URL(theURL);
InputStream is = url.openStream();
int ptr = 0;
StringBuffer buffer = new StringBuffer();
while ((ptr = is.read()) != -1) {
    buffer.append((char)ptr);
}

Not sure at to whether the other solution(s) provided are any more efficient or not.

Upvotes: 23

Justin Bennett
Justin Bennett

Reputation: 9128

I just left this post in your other thread, though what you have above might work as well. I don't think either would be any easier than the other. The Apache packages can be accessed by just using import org.apache.commons.HttpClient at the top of your code.

Edit: Forgot the link ;)

Upvotes: 2

Related Questions