EmJ
EmJ

Reputation: 4608

How to identify wikipedia categories in python

I am currently using pywikibot to obtain the categories of a given wikipedia page (e.g., support-vector machine) as follows.

import pywikibot as pw

print([i.title() for i in list(pw.Page(pw.Site('en'), 'support-vector machine').categories())])

The results I get is:

[
  'Category:All articles with specifically marked weasel-worded phrases',
  'Category:All articles with unsourced statements',
  'Category:Articles with specifically marked weasel-worded phrases from May 2018',
  'Category:Articles with unsourced statements from June 2013',
  'Category:Articles with unsourced statements from March 2017',
  'Category:Articles with unsourced statements from March 2018',
  'Category:CS1 maint: Uses editors parameter',
  'Category:Classification algorithms',
  'Category:Statistical classification',
  'Category:Support vector machines',
  'Category:Wikipedia articles needing clarification from November 2017',
  'Category:Wikipedia articles with BNF identifiers',
  'Category:Wikipedia articles with GND identifiers',
  'Category:Wikipedia articles with LCCN identifiers'
]

As you can see the results I am getting include lot of tracking and maintenance categories of wikipedia such as;

However, the categories I am only interested are;

I am wondering if there is a way to get all tracing or maintenance wikipedia categories, so that I can remove them from the results to get only the informative categories.

Or, please suggest me if there are any other ways of eliminating them from the results.

I am happy to provide more details if needed.

Upvotes: 3

Views: 1053

Answers (1)

AXO
AXO

Reputation: 9086

pywikibot currently does not provide some of the API features for filtering hidden categories. You can do that manually by searching for the hidden key in categoryinfo:

import pywikibot as pw

site = pw.Site('en', 'wikipedia')
print([
    cat.title()
    for cat in pw.Page(site, 'support-vector machine').categories()
    if 'hidden' not in cat.categoryinfo
])

gives:

['Category:Classification algorithms', 
 'Category:Statistical classification', 
 'Category:Support vector machines']

See https://www.mediawiki.org/wiki/Help:Categories#Hidden_categories and https://en.wikipedia.org/wiki/Wikipedia:Categorization#Hiding_categories for more info.

Upvotes: 3

Related Questions