Reputation: 147
I was trying this code to read the content from a webpage, i want to read the links, author names below the links and PDF or HTML links on the right side to my database or some doc file using Java.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class HTMLParserExample1 {
public static void main(String[] args) {
Document doc;
try {
// need http protocol
doc = Jsoup.connect("http://scholar.google.com/scholar? l=en&q=visualization&btnG=&as_sdt=1%2C4&as_sdtp=").userAgent("Chrome").get();
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
System.out.println("\nLinHREF: "+linkHref);
System.out.println("linktext: "+linkText);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Above is my code, earlier it was giving me 403 error, but when i put useragent("Mozilla"), then its giving me null pointer exception.
Exception in thread "main" java.lang.NullPointerException
at HTMLParserExample1.main(HTMLParserExample1.java:20)
Java Result: 1
BUILD SUCCESSFUL (total time: 1 second)
Please help.
Upvotes: 0
Views: 939
Reputation: 32507
Well it works for me if i remove spaces from url
http://scholar.google.com/scholar?l=en&q=visualization&btnG=&as_sdt=1%2C4&as_sdtp=
is just fine. I strongly suggest to use Google API for web searches insteed of straight google parsing.
Here some info about Gdata API.
Upvotes: 1