Reputation: 2061
im writing a simple webcrawler. the problem is, with the link extraction.
i uses the cpp-netlib with boost. here a few lines of my CLink Class.
CLink::CLink(const CLink& father, const std::string& relUrl )
{
uri = relUrl;
boost::network::uri::uri instance(relUrl);
boost::network::uri::uri instanceFather(father.uri);
if ( (valid = boost::network::uri::is_valid(instance)) == 1)
{
scheme = boost::network::uri::scheme(instance);
user_info = boost::network::uri::user_info(instance);
host = boost::network::uri::host(instance);
port = boost::network::uri::port(instance);
path = boost::network::uri::path(instance);
query = boost::network::uri::query(instance);
fragment = boost::network::uri::fragment(instance);
uri = scheme;
uri += "://";
uri += host;
uri += path;
}
else
{
if ( (valid = boost::network::uri::is_valid(instanceFather)) == 1)
{
scheme = boost::network::uri::scheme(instanceFather);
user_info = boost::network::uri::user_info(instanceFather);
host = boost::network::uri::host(instanceFather);
port = boost::network::uri::port(instanceFather);
path = boost::network::uri::path(instance);
query = boost::network::uri::query(instance);
fragment = boost::network::uri::fragment(instance);
uri = scheme;
uri += "://";
uri += host;
uri += path;
}
}
};
CLink::CLink( const std::string& _url )
{
uri = _url;
boost::network::uri::uri instance(_url);
if ( (valid = boost::network::uri::is_valid(instance) ) == 1)
{
scheme = boost::network::uri::scheme(instance);
user_info = boost::network::uri::user_info(instance);
host = boost::network::uri::host(instance);
port = boost::network::uri::port(instance);
path = boost::network::uri::path(instance);
query = boost::network::uri::query(instance);
fragment = boost::network::uri::fragment(instance);
uri = scheme;
uri += "://";
uri += host;
uri += path;
}
else
std::cout << "err " << std::endl;
};
the links from the webpage i took with the htmlcxx lib. i took the HTML::Node and normalize them wih the boost filesystem.
if ( url.find("http://") == std::string::npos)
{
std::string path = link.get_path() + url;
url = link.get_host() + path;
boost::filesystem::path result;
boost::filesystem::path p(url);
for(boost::filesystem::path::iterator it=p.begin(); it!=p.end(); ++it)
{
if(*it == "..")
{
if(boost::filesystem::is_symlink(result) )
result /= *it;
else if(result.filename() == "..")
result /= *it;
else
result = result.parent_path();
}
else if(*it == ".")
{
// Ignore
}
else
{
// Just cat other path entries
result /= *it;
}
}
url = "http://" + result.string();
}
return ret;
Now the problem is.
i try to fetch http://www.wikipedia.de/
and i get the urls like
properties http://wikimedia.de/wiki/Vereinszeitung ... ...
and on the site http://wikimedia.de/wiki/Vereinszeitung
there is a link like /wiki/vereinsatzung
so often i get links like
http://wikimedia.de/wiki/Vereinszeitung/wiki/Freies_Wissen
does someone have a idee?
Upvotes: 0
Views: 347
Reputation: 206861
You need to have a special case for absolute links (those that start with /
).
If the href
starts with /
, then the resulting link should be (using the terms from The URI template which come from the RFC):
[scheme]://[authority][what you got in href]
What you are currently constructing is:
[scheme]://[authority][path][what you got in href]
So you're duplicating the path information.
So if link.get_path()
starts with /
, you should simply change:
std::string path = link.get_path() + url;
url = link.get_host() + path; // this is incorrect btw, missing the [port]
to
url = link.get_host() + ":" + link.get_port() + url;
It would probably be cleaner to do the path normalization on the path only, not on the URL (i.e. add host:port
after normalizing the path).
[And I think your code will fail if it encounters an https
link.]
Upvotes: 1