Val
Val

Reputation: 1832

Get list of nouns and declension of a word from Wiktionary API

I would like to get:

  1. a list of nouns in a specific language
  2. the case declension table for a word in a slavic language.

I was hoping to be able to send something like an http get request with parameters:

https://cs.wiktionary.org/wiki/wordlist/nouns

And for the declension table I was also hoping for an http request and then a response as a JSON object, eg.:

https://cs.wiktionary.org/wiki/program/declension

Expected response:

{
"word":"program",
"singular_declension":
    [
        "nominative":"program",
        "genitive":"programu",
        "dative":"programu",
        "accusative":"program",
        "vocative":"programe",
        "locative":"programu",
        "instrumental":"programem",
    ]
"plural_declension":
    [
        "nominative":"programy",
        "genitive":"programů",
        "dative":"programům",
        "accusative":"programy",
        "vocative":"programy",
        "locative":"programech",
        "instrumental":"programy",
    ]
}

Unfortunately, I cannot find any endpoints for that in the official API specs: https://www.mediawiki.org/wiki/API:Main_page ...nor the documentation: https://www.mediawiki.org/wiki/API:Main_page

How can I get those results? Or do I have to resort to webscraping and extracting this info from the html pages?

Upvotes: 1

Views: 331

Answers (1)

dimakin
dimakin

Reputation: 1948

Action API

Technically, it's possible to use Action API in the following manner:

  1. find the relevant category (Česká substantiva?)
  2. query chunk by chunk all the members of that category (docs, example)
  3. query parse (docs) sections for each page (example)
  4. find by keyword (skloňování?) for declension section
  5. query parse (docs) the text of that section (example)
  6. parse finally the HTML containing the desired table.

Yet this way is hardly scalable.

English Wiktionary

It's hard to claim so, but it seems that English Wiktionary contains most of the lemmas of other languages. program in Czech is also there and with declensions. Furthermore, section name is a standard across all languages (Declension) also the HTML table has lots of metadata inside and is much easier to parse. As an example of a word form:

<span class="Latn form-of lang-lv acc|s-form-of" lang="lv"><a...>māsu</a></span>

Yet before implementing parsing, it is usually a good idea to check for existing solutions.

Wiktextract and Kaikki

These two projects are both based English Wiktionary dumps. Wiktextract provides cli and python interface for such dumps. Kaikki holds the recent extracted version and provides HTTP interface to that. Here are few examples:

All Czech nouns (be careful, it's 46,6MB):

https://kaikki.org/dictionary/Czech/by-pos-noun/kaikki_dot_org-dictionary-Czech-by-pos-noun.json

program description:

https://kaikki.org/dictionary/Czech/meaning/p/pr/program.json

Has slightly different structure from desired, but easily convertible:

...
{"form": "programům", "tags": ["dative", "plural"], "source":  "declension"}, 
{"form": "program", "tags": ["accusative", "singular"], "source": "declension"}
...

Sample code

If the above projects somehow cannot be used, here are some sample parsing code in python.

First, we need to somehow detect language sections, and having a lookup dictionary seems to be good option:

import iso639

languages = {}
for line in iso639.data:
    if not line['iso639_1']:
        continue
        
    names = line['name'].split(';')
    for name in names:
        # handle few exceptions:
        # * Interlingua (International Auxiliary Language Association)
        # * Occitan (post 1500)
        # * Tonga (Tonga Islands)
        name, *_ = name.split('(', maxsplit=1)
        name = name.strip()
        languages[name] = line['iso639_1']

# Serbo-Croatian doesn't exist in ISO, but is used in Wikipedia & Wiktionary 
languages['Serbo-Croatian'] = 'sh'

Section finder:

def get_section_id(data: dict, keyword: str, language: str):
    
    try:
        sections = data['parse']['sections']
    except (KeyError, TypeError):
        return None
    
    section_language = None
    for section in sections:
        if section['line'] in languages:
            section_language = section['line']
            continue

        if section['line'] == keyword and section_language == language:
            return int(section['index'])
    
    return None

Processing:

import requests
from bs4 import BeautifulSoup

url = f'https://en.wiktionary.org/w/api.php'

word = 'māsa'

params = {
    'action': 'parse',
    'format': 'json',
    'page': word,
    'prop': 'sections',
    'disabletoc': True
}

response = requests.get(url, params=params)

section_id = get_section_id(response.json(), 'Declension', 'Latvian')

params = {
    'action': 'parse',
    'format': 'json',
    'page': word,
    'prop': 'text',
    'section': section_id,
    'disabletoc': True
}

response = requests.get(url, params=params)
# word may not exist, may raise exceptions
html = response.json()['parse']['text']['*']

soup = BeautifulSoup(html, 'html.parser')

Result:

result = {
    "word": word,
    "singular_declension": {
        "nominative": soup.find("span", class_="nom|s-form-of").text,
        "genitive": soup.find("span", class_="gen|s-form-of").text,
        "dative": soup.find("span", class_="dat|s-form-of").text,
        "accusative": soup.find("span", class_="acc|s-form-of").text,
        "vocative": soup.find("span", class_="voc|s-form-of").text,
        "locative": soup.find("span", class_="loc|s-form-of").text,
        "instrumental": soup.find("span", class_="ins|s-form-of").text
    },
    "plural_declension": {
        "nominative": soup.find("span", class_="nom|p-form-of").text,
        "genitive": soup.find("span", class_="gen|p-form-of").text,
        "dative": soup.find("span", class_="dat|p-form-of").text,
        "accusative": soup.find("span", class_="acc|p-form-of").text,
        "vocative": soup.find("span", class_="voc|p-form-of").text,
        "locative": soup.find("span", class_="loc|p-form-of").text,
        "instrumental": soup.find("span", class_="ins|p-form-of").text
    }
}

print(result)
{'word': 'māsa',
 'singular_declension': {'nominative': 'māsa', 'genitive': 'māsas',
                         'dative': 'māsai', 'accusative': 'māsu',
                         'vocative': 'māsa', 'locative': 'māsā',
                         'instrumental': 'māsu'},
 'plural_declension': {'nominative': 'māsas', 'genitive': 'māsu',
                       'dative': 'māsām', 'accusative': 'māsas',
                       'vocative': 'māsas', 'locative': 'māsās',
                       'instrumental': 'māsām'}}

Upvotes: 1

Related Questions