Tiberiu
Tiberiu

Reputation: 1030

Web Crawler URL results not exact

I've just made my first web crawler, my goal was simply to go on www.nhl.com, and create a database that contains every anchor and button, as well as the URL that they forward to.

The code seems to be working fine, but I have two questions about the output.

Here are two examples of URL entries in my database:

1.http://www.nhl.com/ice/event.htm?location=/stadiumseries/2014/chi/responsive

2./ice/m_events.htm

Why do some record the entire URL, whereas others only have the second part of it? [ANSWERED]

Second question, take for example this row entry:

9 Players /ice/m_playersearch.htm, which is in the form [id, anchor, url]

When I go to the website in my browser and click on "Players", the URL in my browser becomes:

http://www.nhl.com/ice/playersearch.htm?navid=nav-ply-plyrs#

which has a second part to the URL that my table entry does not (?navid=nav-ply-plyrs#)

Having said that, entering the URL given to me by my database still ends up redirecting me to the same page, so it doesn't seem to be a mistake. I'm just wondering why/how it's able to determine that the second part of the URL is not needed.

Here is part of my code:

public void crawl(String url){

    try{
        Document doc = Jsoup.connect(url).get();

        Elements pgElem = doc.select("a");
        int id = 0;

        for(Element e : pgElem){
            db.insert(id, e.text(), e.attr("href"));
            id++;
        }

        db.close();   

    }catch(IOException e){
        e.printStackTrace();
    }
}

And my insert method:

 public void insert(int id, String anchor, String url) {

    String string = "INSERT INTO nhl (id,Anchor,URL) " + "VALUES (?, ?, ?)";
    try {
        pst=conn.prepareStatement(string);
        pst.setInt(1, id);
        pst.setString(2, anchor);
        pst.setString(3, url);
        pst.executeUpdate();
    } catch (SQLException e) {
        e.printStackTrace();
    }
}

Upvotes: 0

Views: 100

Answers (1)

tjg184
tjg184

Reputation: 4686

Change e.attr("href") to e.attr("abs:href") to get absolute urls.

Upvotes: 1

Related Questions