mediawikicategoriesdumpwikipediawikipedia-api

Reputation: 3529

Wikipedia Category Hierarchy from dumps

Using Wikipedia's dumps I want to build a hierarchy for its categories. I have downloaded the main dump (enwiki-latest-pages-articles) and the category SQL dump (enwiki-latest-category). But I can't find the hierarchy information.

For example, the SQL categories' dump has entries for each category but I can't find anything about how they relate to each other.

The other dump (latest-pages-articles) says the parent categories for each page but in an unordered way. It just states all the parents.

I have seen wikiprep's category hierarchy (http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/)... How is that one constructed? Wikiprep lists the category ID, not its name. Is there a way to get the name for each ID?

Upvotes: 21

Answers (3)

Ciro Santilli OurBigBook.com

Reputation: 382512

How the data is encoded

You need two tables:

categorylinks: assigns the parent category of articles and other categories. "Documented" at: https://www.mediawiki.org/wiki/Manual:Categorylinks_table
page: each entry represents an article or category. This page contains the IDs to which categorylinks refers, and is therefore required to recurse. Documented at: https://www.mediawiki.org/wiki/Manual:Page_table

The way things are linked up is:

categorylinks.cl_from (integer) points to the page.page_id (unique integer) of the child page or category
categorylinks.cl_to (text) points to the page.page_title (text, unique together with namespace) of the parent category

Here's a concrete example:

mysql enwiki -e "select page_id, page_namespace, page_title, page_is_redirect from page where page_namespace in (0, 14) and page_title in ('Albert_Einstein')"

gives:

+---------+----------------+-----------------+------------------+
| page_id | page_namespace | page_title      | page_is_redirect |
+---------+----------------+-----------------+------------------+
|     736 |              0 | Albert_Einstein |                0 |
| 2504643 |             14 | Albert_Einstein |                0 |
+---------+----------------+-----------------+------------------+

so we see that there are there is both:

page_namespace=0: a page for Albert_Einstein: https://en.wikipedia.org/wiki/Albert_Einstein (which unsurprisingly is one of the first 1k pages on Wikipedia)
page_namespace=14: a category for Albert_Einstein: https://en.wikipedia.org/wiki/Category:Albert_Einstein

The namespace IDs are listed at: https://en.wikipedia.org/wiki/Wikipedia:Namespace

mysql enwiki -e "select cl_from, cl_to from categorylinks where cl_from in (736, 2504643)"

gives:

+---------+--------------------------------------------------------------------------------------------------------------+                                                                    
| cl_from | cl_to                                                                                                        |                                                                    
+---------+--------------------------------------------------------------------------------------------------------------+                                                                    
|     736 | 1879_births                                                                                                  |                                                                    
|     736 | 1955_deaths                                                                                                  |                                                                    
|     736 | 19th-century_German_Jews                                                                                     |                                                                    
|     736 | 20th-century_American_engineers                                                                              |
...
|     736 | Winners_of_the_Max_Planck_Medal                                                                              |
|     736 | Württemberger_emigrants_to_the_United_States                                                                 |
| 2504643 | Commons_category_link_is_on_Wikidata                                                                         |
| 2504643 | Wikipedia_categories_named_after_American_scientists                                                         |
| 2504643 | Wikipedia_categories_named_after_German_scientists                                                           |
| 2504643 | Wikipedia_categories_named_after_Swiss_people                                                                |
| 2504643 | Wikipedia_categories_named_after_physicists                                                                  |
+---------+--------------------------------------------------------------------------------------------------------------+

therefore, e.g.:

the article Albert_Einstein, which has ID 736, is under categories such as 1879_births and 1955_deaths
the category Category:Albert_Einsten, which has ID 2504643 is under Wikipedia_categories_named_after_German_scientists and Wikipedia_categories_named_after_physicists

Actually getting the tree

There many possible approaches to getting the data out, this was my path of lowest resistance.

MySQL was feeling a bit slower than it should, so I just reverted to my beloved SQLite: How to obtain a list of titles of all Wikipedia articles

PostgreSQL and MySQL will have better recursive query support with arrays which allow for nicer:

preorder traversal: Preorder tree traversal using recursive CTEs in SQL
cycle detection: To find infinite recursive loop in CTE

but it was feeling like a Python loop would provide greater flexibility anyways, so I just went for that.

I imported to SQLite with the help of https://github.com/jamesmishra/mysqldump-to-csv:

git clone https://github.com/jamesmishra/mysqldump-to-csv
cd mysqldump-to-csv
git checkout 24301dfa739c13025844ed3ff5a8abe093ced6cc
patch <<'EOF'
diff --git a/mysqldump_to_csv.py b/mysqldump_to_csv.py
index b49cfe7..8d5bb2a 100644
--- a/mysqldump_to_csv.py
+++ b/mysqldump_to_csv.py
@@ -101,7 +101,8 @@ def main():
     # listed in sys.argv[1:]
     # or stdin if no args given.
     try:
-        for line in fileinput.input():
+        sys.stdin.reconfigure(errors='ignore')
+        for line in fileinput.input(encoding="utf-8", errors="ignore"):
             # Look for an INSERT statement and parse it.
             if is_insert(line):
                 values = get_values(line)
EOF

The patch is to overcome binary data madness in categorylinks: https://github.com/jamesmishra/mysqldump-to-csv/issues/17

Then I get the files:

wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-categorylinks.sql.gz
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.sql.gz

and import into SQLite with:

import-sqlite.sh

db=enwiki.sqlite
rm -f "$db"

# page
sqlite3 "$db" 'create table "page"("page_id" integer, "page_namespace" integer, "page_title" text, "page_is_redirect" integer, "page_len" integer)'
time zcat enwiki-latest-page.sql.gz | python mysqldump-to-csv/mysqldump_to_csv.py | csvtool col 1,2,3,4,10 - | sqlite3 "$db" ".import --csv '|cat -' page"
du -h "$db"
time sqlite3 "$db" 'create unique index "page_id" on "page"("page_id")'
du -h "$db"
time sqlite3 "$db" 'create index "page_namespace_title" on "page"("page_namespace", "page_title")'
du -h "$db"

# categorylinks
sqlite3 "$db" 'create table categorylinks("cl_from" integer, "cl_to" text)'
time zcat enwiki-latest-categorylinks.sql.gz | python mysqldump-to-csv/mysqldump_to_csv.py | csvtool col 1,2 - | sqlite3 "$db" ".import --csv '|cat -' categorylinks"
du -h "$db"
time sqlite3 "$db" 'create index "categorylinks_to" on categorylinks("cl_to")'
du -h "$db"

On my Lenovo ThinkPad P51:

+ sqlite3 enwiki.sqlite '.import --csv '\''|cat -'\'' page'

real    6m25.156s
user    9m57.253s
sys     0m25.416s
+ du -h enwiki.sqlite
1.6G    enwiki.sqlite
+ sqlite3 enwiki.sqlite 'create unique index "page_id" on "page"("page_id")'

real    1m6.278s
user    0m28.639s
sys     0m35.041s
+ du -h enwiki.sqlite
2.0G    enwiki.sqlite
+ sqlite3 enwiki.sqlite 'create index "page_namespace_title" on "page"("page_namespace", "page_title")'

real    2m5.788s
user    1m18.303s
sys     0m25.823s
+ du -h enwiki.sqlite
3.0G    enwiki.sqlite
+ sqlite3 enwiki.sqlite 'create table categorylinks("cl_from" integer, "cl_to" text)'
+ zcat enwiki-latest-categorylinks.sql.gz
+ python mysqldump-to-csv/mysqldump_to_csv.py
+ csvtool col 1,2 -
+ sqlite3 enwiki.sqlite '.import --csv '\''|cat -'\'' categorylinks'

real    20m30.865s
user    29m13.953s
sys     1m6.885s
+ du -h enwiki.sqlite
6.8G    enwiki.sqlite
+ sqlite3 enwiki.sqlite 'create index "categorylinks_to" on categorylinks("cl_to")'

real    6m36.867s
user    3m12.710s
sys     2m13.543s
+ du -h enwiki.sqlite
8.1G    enwiki.sqlite

Then I walk the category tree preorder depth first with this script and output a few graph representations:

sqlite_preorder.py

#!/usr/bin/env python

import argparse
import html
import sys
import sqlite3
import re
import os.path
from pathlib import Path

def escape_dot(s):
    return s.replace('"', '\\"')

NAMESPACE_TO_TEXT = {
    0: '',
    14: 'Category:',
}

def ns_to_txt(ns):
    if ns in NAMESPACE_TO_TEXT:
        return NAMESPACE_TO_TEXT[ns]
    else:
        return str(ns)

parser = argparse.ArgumentParser()
parser.add_argument('-d', '--depth', type=int)
parser.add_argument('db')
parser.add_argument('titles', nargs='+')
args = parser.parse_args()
out_dot = True
out_txt = True
out_html = True
outdir = 'out'

con = sqlite3.connect(args.db)
cur = con.cursor()
todo = list(map(lambda title: (14, title, 0), args.titles.copy()))
main_title = args.titles[0]
main_title_human = main_title.replace('_', ' ') + ' - Wikipedia CatTree'
basename = os.path.join(outdir, main_title)
visited = set()
Path(outdir).mkdir(parents=True, exist_ok=True)
if out_txt:
    out_txt_f = open(f'{basename}.txt', 'w')
if out_dot:
    out_dot_f = open(f'{basename}.dot', 'w')
    out_dot_f.write('digraph {\n')
if out_html:
    out_html_f = open(f'{basename}.html', 'w')
    out_html_f.write(f'''<!doctype html>
<html lang=en>
<head>
<meta charset=utf-8>
<title>{main_title_human}</title>
<style>
a {{ text-decoration: none; }}
details {{ margin-left: 1em; }}
summary {{ margin-bottom: 0.4em; }}
</style>
</head>
<body>
<h1>{main_title_human}</h1>
''')
last_depth = 0
while len(todo):
    namespace, title, depth = todo.pop()
    depth_delta = depth - last_depth
    if depth_delta <= 0:
        repeat_close = -depth_delta + 1
    else:
        repeat_close = 0
    last_depth = depth
    path_last = ns_to_txt(namespace) + title
    if out_txt:
        out_txt_f.write('{}{} {}\n'.format(' ' * depth, depth, path_last))
    if out_html:
        out_html_f.write('</details>\n' * repeat_close)
        out_html_f.write(f'<details open="true"><summary><a href="https://en.wikipedia.org/wiki/{html.escape(path_last)}">{html.escape(path_last.replace("_", " "))}</a></summary>\n')
    visited.add((namespace, title))
    print(len(visited), file=sys.stderr)
    if namespace == 14:
        for child_namespace, child_title in cur.execute('''
select page_namespace, page_title from categorylinks
inner join page on cl_from = page_id and cl_to = ?
order by page_namespace asc, page_title desc
''', (title,)).fetchall():
            if not (child_namespace, child_title) in visited and not (args.depth is not None and depth == args.depth):
                if out_dot:
                    out_dot_f.write('"{}{}"->"{}{}";\n'.format(ns_to_txt(namespace), escape_dot(title), ns_to_txt(child_namespace), escape_dot(child_title)))
                todo.append((child_namespace, child_title, depth + 1))
if out_txt:
    out_txt_f.close()
if out_dot:
    out_dot_f.write('}\n')
    out_dot_f.close()
if out_html:
    out_html_f.write('</details>\n' * last_depth)
    out_html_f.write('''</body>
</html>
''')
    out_html_f.close

Sample usage to walk down from https://commons.wikimedia.org/wiki/Category:Mathematics up to 6 levels down

time ./sqlite_preorder.py -d6 enwiki.sqlite Mathematics

This walked 82460 pages in 2m30.895s.

It produces the following output files illustrating some basic visualization ideas:

out/Mathematics.html: collapsible table of contents with details and summary HTML elements and clickable links to Wikipedia. I'm quite proud of this one! Live demo: https://cirosantilli.com/wikipedia-cattree/Mathematics

out/Mathematics.txt: simple plaintext tree, e.g.:

0 Category:Mathematics
1 Category:Fields_of_mathematics
  2 Category:Algebra
   3 Category:Abstract_algebra
    4 Category:Abstract_algebra_stubs
     5 Category:Algebraic_geometry_stubs
      6 10Algebraic-geometry-stub
      6 3-fold
      6 Abelian_surface
      6 Abhyankar–Moh_theorem
      6 Abundance_conjecture

out/Mathematics.dot: Graphviz format, which is understood by a few different applications

Tested on Ubuntu 23.04.

Upvotes: 1

amirouche

Reputation: 7873

Loading the dump of category links etc... to build a wikipedia hierarchy is very long (even if interesting).

I found fast path that give good result. I rely on wikipedia vital articles hierarchy. See for instance, sensimark for an example use.

Upvotes: 2

svick

Reputation: 244757

The category hierarchy information in MediaWiki is stored in the categorylinks table, so you're going to need the categorylinks dump.

You're also going to need the page (not pages-articles) dump for page id to title mapping.

Upvotes: 19

Wikipedia Category Hierarchy from dumps

Answers (3)

Related Questions