Reputation: 23767
I recently used Wikipedia's function "What links here" (which is found under the "Toolbox" element in any entry's left menu) and it got me started wondering how this function actually works.
I'm guessing that searching through all the article entries after links isn't very effective, so are all the links stored in a separate database? If so, is this updated when an article is edited or another time?
Thanks.
Upvotes: 14
Views: 1892
Reputation: 10050
Pseudo code for a simple algorithm that would do it
procedure updateChanges(editedPage):
for_each(link on editedPage):
if(link is not to another wikipedia page): continue
pageToUpdate = open(link):
if(pageToUpdate->whatLinksHere.contains(editedPage)): continue
pageToUpdate->whatLinksHere.insert(editedPage)
Sorry I just finished my algorithms class so I have an urge to write pseudo code. In this context, the updateChanges()
procedure would be something called during the "update the 'what links here' for other pages" phase that Greg Hewgill referred to.
Upvotes: 2
Reputation: 1269
You could think this as a more general problem. If you have a link (or pointer or whatever) from A to B, how can B know that A has a link pointing there?
The answer is to store the information to target location. That is, when the page A is edited and a link is created to B, at the same time store information about the link source to B (a reverse link). In case of a web page, the reverse link could be written directly into "what links here" page. Just a single write into a static page. No need to perform any searches or database queries.
Upvotes: 2
Reputation: 12132
It would make sense for the "update event" of an article to trigger a links parser as this is the only time an article is going to change. The update event in turn would simply scan for links, and query the db for links that are internal to wikipedia.
I imagine each page has a primary key and a simple association table is created to relate the pages PK to all the other pages that link to it.
Theres likely some additional bits that get added to aid performance on such a large site but that would be the basic mechanics.
Upvotes: 1
Reputation: 124768
The way I would implement is to get all the links after an edit, then store them in a separate table with the key being the current url. Then I could just query the table with the URL the user is currently on and get all the links that have been marked as linking to that page.
It probably wouldn't be as straightforward as that but that's the general, simplified idea. Probably instead of URLs it would be wiser to store page IDs and so on.
Upvotes: 1
Reputation: 992837
Whenever a page on Wikipedia is edited, it is placed into a background queue that does some further processing. Some of the things that happen there are:
This sort of information doesn't need to be updated right away when you hit "Submit", so the background processing queue takes care of it. Sometimes this queue can grow quite large, but usually it's kept under control.
You can find more information about this at Help:Job Queue.
Upvotes: 16