mmachenry
mmachenry

Reputation: 1962

MediaWiki API: Get all pages on sublists of lists on Wikipedia?

I am writing an application that needs lists of Wikipedia page tiles within a certain category. Some categories work really well for this. For example Category:English-language_films is a category which is attributed to about 60k pages. Using the MediaWiki's API I can query with the list=categorymembers, I can get a list of all 60k films.

However this works much less well with something like hockey players in the NHL. Category:Lists_of_National_Hockey_League_players is about as close as a category gets but this is a category of list pages. It turns out that the concept of NHL players is stored in lists, not categories. Where the concept of English-language films is stored as a category.

It's rather difficult to obtain the actual list, simply because these lists themselves are broken up into several sub lists by alphabet or team. It's theoretically possible to screen scrape the data, but simply getting the list of Wikipedia pages linked from that page is error prone.

Is there a straight-forward way to get pages that are listed by lists, including expanding sub lists using the API or some way to tell from the content of a list whether a link is a member of the list or just meta data about the member of the list?

Upvotes: 0

Views: 159

Answers (1)

Tgr
Tgr

Reputation: 28210

When there is a category of list of things, chances are there will be a category of things as well. In your case that would be Category:National Hockey League players. You can walk that recursively with the categorymembers API. (Unlike lists, categories can't contain red links so depending on your use case that might be a problem.)

Other than that, Wikipedia APIs won't be much help. You can check Wikidata for something appropriate (e.g. data items with the NHL.com player ID property); that's a different data set but sometimes it is kept in sync, and always easy to query. If that's not appropriate, you'll have to scrape the HTML.

Upvotes: 0

Related Questions