Kennedy Kan
Kennedy Kan

Reputation: 383

JAVA parsing table data

I would like to extract some html data from page source. Here is the ref. link have a html link view-source:http://www.4icu.org/reviews/index2.htm. I would like to ask how could I extract only the name of the university and the country name with JAVA. I know the way to just extract the university name as they are between , but how could I make the program faster by just scanning the table when class="i" and extract also the country, i.e. United States, with the <...alt="United States" />

<tr>
<td><a name="UNIVERSITIES-BY-NAME"></a><h2>A-Z list of world Universities and Colleges</h2>
</tr>

<tr>
<td class="i"><a href="/reviews/9107.htm"> A.T. Still University</a></td>
<td width="50" align="right" nowrap>us <img src="/i/bg.gif" class="fl flag-us" alt="United States" /></td>
</tr>

Thanks in advance.

EDIT Following what @11thdimension has said, here is my .java file

public class University {
    public static void main(String[] args) throws Exception {
        System.out.println("Started");

        URL url = new URL ("http://www.4icu.org/reviews/index2.htm");

        URLConnection spoof = url.openConnection();        
        // Spoof the connection so we look like a web browser
        spoof.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)");

        String connect = url.toString();
        Document doc = Jsoup.connect(connect).get();

        Elements cells = doc.select("td.i");

        Iterator<Element> iterator = cells.iterator();

        while (iterator.hasNext()) {
            Element cell = iterator.next();
            String university = cell.select("a").text();
            String country = cell.nextElementSibling().select("img").attr("alt");

            System.out.printf("country : %s, university : %s %n", country, university);
        }
    }
}

However, when I run it, it gives me the following error.

Started
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=http://www.4icu.org/reviews/index2.htm

EDIT2 I have created the following program to get the header of the html site.

public class Get_Header {
  public static void main(String[] args) throws Exception {
    URL url = new URL("http://www.4icu.org/reviews/index2.htm");
    URLConnection connection = url.openConnection();

    Map responseMap = connection.getHeaderFields();
    for (Iterator iterator = responseMap.keySet().iterator(); iterator.hasNext();) {
      String key = (String) iterator.next();
      System.out.println(key + " = ");

      List values = (List) responseMap.get(key);
      for (int i = 0; i < values.size(); i++) {
        Object o = values.get(i);
        System.out.println(o + ", ");
      }
    }
  }
}

It retunrs the following result.

X-Frame-Options = 
SAMEORIGIN, 
Transfer-Encoding = 
chunked, 
null = 
HTTP/1.1 403 Forbidden, 
CF-RAY = 
2ca61c7a769b1980-HKG, 
Server = 
cloudflare-nginx, 
Cache-Control = 
max-age=10, 
Connection = 
keep-alive, 
Set-Cookie = 
__cfduid=d4f8d740e0ae0dd551be15e031359844d1469853403; expires=Sun, 30-Jul-17 04:36:43 GMT; path=/; domain=.4icu.org; HttpOnly, 
Expires = 
Sat, 30 Jul 2016 04:36:53 GMT, 
Date = 
Sat, 30 Jul 2016 04:36:43 GMT, 
Content-Type = 
text/html; charset=UTF-8, 

Though I can get the header, but how should I combine the code in EDIT and EDIT2 to form a complete one? Thanks.

Upvotes: 0

Views: 127

Answers (1)

11thdimension
11thdimension

Reputation: 10633

If it's going to be a single time task then you should probably use Javascript fot it.

Following code will log the required names in the console. You'll have to run it in the browser console.

(function () {
    var a = [];
    document.querySelectorAll("td.i a").forEach(function (anchor) { a.push(anchor.textContent.trim());});

    console.log(a.join("\n"));
})();

Following is a Java example with Jsoup selectors

Maven Dependency

<dependencies>
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.8.3</version>
    </dependency>
</dependencies>

Java Code

import java.io.File;
import java.util.Iterator;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class TestJsoup {
    public static void main(String[] args) throws Exception {
        System.out.println("Starteed");

        File file = new File("A-Z list of 11930 World Colleges & Universities.html");
        Document doc = Jsoup.parse(file, "UTF-8");

        Elements cells = doc.select("td.i");

        Iterator<Element> iterator = cells.iterator();

        while (iterator.hasNext()) {
            Element cell = iterator.next();
            String university = cell.select("a").text();
            String country = cell.nextElementSibling().select("img").attr("alt");

            System.out.printf("country : %s, university : %s %n", country, university);
        }
    }
}

Upvotes: 1

Related Questions