Reputation: 1030
I've just made my first web crawler, my goal was simply to go on www.nhl.com, and create a database that contains every anchor and button, as well as the URL that they forward to.
The code seems to be working fine, but I have two questions about the output.
Here are two examples of URL entries in my database:
1.http://www.nhl.com/ice/event.htm?location=/stadiumseries/2014/chi/responsive
2./ice/m_events.htm
Why do some record the entire URL, whereas others only have the second part of it? [ANSWERED]
Second question, take for example this row entry:
9 Players /ice/m_playersearch.htm
, which is in the form [id, anchor, url]
When I go to the website in my browser and click on "Players", the URL in my browser becomes:
http://www.nhl.com/ice/playersearch.htm?navid=nav-ply-plyrs#
which has a second part to the URL that my table entry does not (?navid=nav-ply-plyrs#
)
Having said that, entering the URL given to me by my database still ends up redirecting me to the same page, so it doesn't seem to be a mistake. I'm just wondering why/how it's able to determine that the second part of the URL is not needed.
Here is part of my code:
public void crawl(String url){
try{
Document doc = Jsoup.connect(url).get();
Elements pgElem = doc.select("a");
int id = 0;
for(Element e : pgElem){
db.insert(id, e.text(), e.attr("href"));
id++;
}
db.close();
}catch(IOException e){
e.printStackTrace();
}
}
And my insert method:
public void insert(int id, String anchor, String url) {
String string = "INSERT INTO nhl (id,Anchor,URL) " + "VALUES (?, ?, ?)";
try {
pst=conn.prepareStatement(string);
pst.setInt(1, id);
pst.setString(2, anchor);
pst.setString(3, url);
pst.executeUpdate();
} catch (SQLException e) {
e.printStackTrace();
}
}
Upvotes: 0
Views: 100