user1011689
user1011689

Reputation: 1

how to grab actual links directed by hrefs

I'm parsing the html code of a webpage and grabbing all the links mentioned as hrefs using regex, but some websites for instance wikipedia, mention certain hrefs in the html code as a paraphrase for example:

code says:

href="#cite_note-Types_of_Test_Item_Formats-

but link is actually: http://en.wikipedia.org/wiki/Test_(assessment)#cite_note-Types_of_Test_Item_Formats-15

how can I get to these links using only webpage source?

EDIT: coding in java

Any Help is appreciated

Upvotes: 0

Views: 109

Answers (2)

T.J. Crowder
T.J. Crowder

Reputation: 1075447

They're not paraphrasings, they're fragment identifiers. The # introduces an identifier for a fragment of a page. So what you've quoted is a relative URL for the current page, with a different fragment identifier. There's more in the Wikipedia page about URLs and the RFCs it links to.

Note that fragments don't necessarily only show up on their own. They can be in any URL, relative or absolute. If you're going to handle URLs, you'll have to undrstand how to resolve relative URLs. For instance, if we assume we're on the page http://example.com/foo/bar.html, then:

  • #frag
    resolves to
    http://example.com/foo/bar.html#frag
  • ../alt.html
    =>
    http://example.com/foo/alt.html
  • /bonzo/nifty#stuff
    =>
    http://example.com/bonzo/nifty#stuff
  • //stackoverflow.com/questions/8110960/8110987#8110987 (note the lack of protocol)
    =>
    http://stackoverflow.com/questions/8110960/8110987#8110987
    (yes, really)

...etc., etc.

Upvotes: 2

Mikael Sundberg
Mikael Sundberg

Reputation: 783

On wikipedia, that just refers to a part on the page (you are currently on), the browser will just scroll down to the anchor for you. on some sites though, like twitter. my account for example http://twitter.com/#!/msundb (and http://twitter.com/msundb that forwards to it) is actualy just the root of twitter.com. everything after the #! is there to tell the javascript on the page what content it should load. It even has the link rel canonical set to "/" telling google that it is the startpage (although it isnt).

So how you should interpret the links depends on what you are doing with them.

Upvotes: 0

Related Questions