Reputation: 1
I'm parsing the html code of a webpage and grabbing all the links mentioned as hrefs using regex, but some websites for instance wikipedia, mention certain hrefs in the html code as a paraphrase for example:
code says:
href="#cite_note-Types_of_Test_Item_Formats-
but link is actually: http://en.wikipedia.org/wiki/Test_(assessment)#cite_note-Types_of_Test_Item_Formats-15
how can I get to these links using only webpage source?
EDIT: coding in java
Any Help is appreciated
Upvotes: 0
Views: 109
Reputation: 1075447
They're not paraphrasings, they're fragment identifiers. The #
introduces an identifier for a fragment of a page. So what you've quoted is a relative URL for the current page, with a different fragment identifier. There's more in the Wikipedia page about URLs and the RFCs it links to.
Note that fragments don't necessarily only show up on their own. They can be in any URL, relative or absolute. If you're going to handle URLs, you'll have to undrstand how to resolve relative URLs. For instance, if we assume we're on the page http://example.com/foo/bar.html
, then:
#frag
http://example.com/foo/bar.html#frag
../alt.html
http://example.com/foo/alt.html
/bonzo/nifty#stuff
http://example.com/bonzo/nifty#stuff
//stackoverflow.com/questions/8110960/8110987#8110987
(note the lack of protocol)http://stackoverflow.com/questions/8110960/8110987#8110987
...etc., etc.
Upvotes: 2
Reputation: 783
On wikipedia, that just refers to a part on the page (you are currently on), the browser will just scroll down to the anchor for you. on some sites though, like twitter. my account for example http://twitter.com/#!/msundb (and http://twitter.com/msundb that forwards to it) is actualy just the root of twitter.com. everything after the #! is there to tell the javascript on the page what content it should load. It even has the link rel canonical set to "/" telling google that it is the startpage (although it isnt).
So how you should interpret the links depends on what you are doing with them.
Upvotes: 0