lizarisk
lizarisk

Reputation: 7820

How to distinguish a Wikipedia article by URL?

There are many pages on Wikipedia that are not articles, e.g. talk pages etc. How to distinguish them from articles by URL?

Upvotes: 1

Views: 350

Answers (2)

svick
svick

Reputation: 244878

You can get the list of Wikipedia namespaces and their aliases by using its API with the following query:

http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces

Then, if the part of the page title before the first colon matches any of the known namespaces, it's not an article; otherwise, it is.

Upvotes: 1

lambshaanxy
lambshaanxy

Reputation: 23062

The short answer is that you can't with regexes alone.

The longer answer is that MediaWiki articles are divided by namespace, which in turn use colons as markers, as in "Talk:Foo". Articles without a colon in the title are thus definitely in the main (= content) namespace. Problem is, articles with a colon may either be in another namespace, or be content articles that happen to contain a colon, and since WP's list of namespaces is long and ever-changing you can't (or at least shouldn't) hardcode this list in a regex.

The correct answer is thus to use the MediaWiki API to iterate/search for articles in the main namespace only.

Upvotes: 2

Related Questions