theyuv
theyuv

Reputation: 1624

Extract data from wikipedia article

I am trying to extract an organized list of categories and their subcategories from a wikipedia article. The article is: http://en.wikipedia.org/wiki/Outline_of_academic_disciplines It doesn't have to be dynamically coded into my site. I am also willing to manually extract the data with the help of a spreadsheet (importxml, importhtml, etc. statements). However, I still have not found an elegant way to do either (spreadsheet extraction or via API) for the above article. (by viewing the source code you can see that importhtml with table as the query inputs all list items in a single cell and importhtml with list as the query doesn't differentiate between lists (ie: there's no way of knowing which lists are sublists of which categories)). Can someone provide some suggestions.

Upvotes: 0

Views: 266

Answers (1)

Aubrey
Aubrey

Reputation: 507

In Wikipedia "Category" is a specific term: to get the categories of that article via API, the query is the following:

http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=categories&titles=Outline%20of%20academic%20disciplines

But, as far as I can tell, you want all the URLs of the Wikipedia articles that are listed in that table.

There are several ways you can do that: the simplest is to take the wikicode from the article (here), paste it in a good editor (I recommend Sublime), and then you can use Search & Replace to scrape off the "[["" and "]]", plus adding in front of every article the URL

http://en.wikipedia.org/wiki/

With that, you can have the whole list of URLs to the articles mentioned in that page. Hope this is what you seek for (you mention some code, but I can't see any).

Upvotes: 0

Related Questions