Reputation: 1
I'm doing a small parser to get some data about diseases in CDC webpage. I'm using jsoup, and everything seems to be working ok except this.
I've four example urls which I've parsed to obtain the link to the "section" that contains the data that I want (see code).
If you see the code of each page you will check that these links exists.
After obtain this link (internal link) and try to retrieve the "element" object that with this value I found that it is working in two of the four pages, and I don't know the reason.
Here is my code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class MainJSoupTest {
public MainJSoupTest() {
try {
test("http://www.cdc.gov/HAI/organisms/bCepacia.html", "#a3");
test("http://www.cdc.gov/meningitis/bacterial.html", "#symptoms");
test("http://www.cdc.gov/nczved/divisions/dfbmd/diseases/botulism/", "#symptoms");
test("http://www.cdc.gov/getsmart/antibiotic-use/URI/bronchitis.html", "c");
} catch (Exception e) {
e.printStackTrace();
}
}
private void test(String url, String element) throws Exception {
Document doc = Jsoup.connect(url).get();
Elements els = doc.select(element);
System.out.println(" ---- Test -----");
System.out.println("URL: " + url);
System.out.println("Element: " + element);
System.out.println("Size: " + els.size());
}
public static void main(String[] args) {
new MainJSoupTest();
}
}
And the output:
---- Test -----
URL: http://www.cdc.gov/HAI/organisms/bCepacia.html
Element: #a3
Size: 1
---- Test -----
URL: http://www.cdc.gov/meningitis/bacterial.html
Element: #symptoms
Size: 0
---- Test -----
URL: http://www.cdc.gov/nczved/divisions/dfbmd/diseases/botulism/
Element: #symptoms
Size: 1
---- Test -----
URL: http://www.cdc.gov/getsmart/antibiotic-use/URI/bronchitis.html
Element: c
Size: 0
As you could see, the size for two of the pages is 1 (as expected, there is an element which represents the internal link). However, the other two returns 0.
Any though?
Upvotes: 0
Views: 94
Reputation: 6724
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
You can get elements just input element names. Not like "elements"
Upvotes: 0