Reputation: 1832
I would like to get:
I was hoping to be able to send something like an http get request with parameters:
https://cs.wiktionary.org/wiki/wordlist/nouns
And for the declension table I was also hoping for an http request and then a response as a JSON object, eg.:
https://cs.wiktionary.org/wiki/program/declension
Expected response:
{
"word":"program",
"singular_declension":
[
"nominative":"program",
"genitive":"programu",
"dative":"programu",
"accusative":"program",
"vocative":"programe",
"locative":"programu",
"instrumental":"programem",
]
"plural_declension":
[
"nominative":"programy",
"genitive":"programů",
"dative":"programům",
"accusative":"programy",
"vocative":"programy",
"locative":"programech",
"instrumental":"programy",
]
}
Unfortunately, I cannot find any endpoints for that in the official API specs: https://www.mediawiki.org/wiki/API:Main_page ...nor the documentation: https://www.mediawiki.org/wiki/API:Main_page
How can I get those results? Or do I have to resort to webscraping and extracting this info from the html pages?
Upvotes: 1
Views: 331
Reputation: 1948
Technically, it's possible to use Action API
in the following manner:
HTML
containing the desired table.Yet this way is hardly scalable.
It's hard to claim so, but it seems that English Wiktionary contains most of the lemmas of other languages. program
in Czech is also there and with declensions. Furthermore, section name is a standard across all languages (Declension) also the HTML
table has lots of metadata inside and is much easier to parse. As an example of a word form:
<span class="Latn form-of lang-lv acc|s-form-of" lang="lv"><a...>māsu</a></span>
Yet before implementing parsing, it is usually a good idea to check for existing solutions.
These two projects are both based English Wiktionary dumps. Wiktextract provides cli
and python
interface for such dumps. Kaikki holds the recent extracted version and provides HTTP
interface to that. Here are few examples:
All Czech nouns (be careful, it's 46,6MB):
https://kaikki.org/dictionary/Czech/by-pos-noun/kaikki_dot_org-dictionary-Czech-by-pos-noun.json
program
description:
https://kaikki.org/dictionary/Czech/meaning/p/pr/program.json
Has slightly different structure from desired, but easily convertible:
...
{"form": "programům", "tags": ["dative", "plural"], "source": "declension"},
{"form": "program", "tags": ["accusative", "singular"], "source": "declension"}
...
If the above projects somehow cannot be used, here are some sample parsing code in python
.
First, we need to somehow detect language sections, and having a lookup dictionary seems to be good option:
import iso639
languages = {}
for line in iso639.data:
if not line['iso639_1']:
continue
names = line['name'].split(';')
for name in names:
# handle few exceptions:
# * Interlingua (International Auxiliary Language Association)
# * Occitan (post 1500)
# * Tonga (Tonga Islands)
name, *_ = name.split('(', maxsplit=1)
name = name.strip()
languages[name] = line['iso639_1']
# Serbo-Croatian doesn't exist in ISO, but is used in Wikipedia & Wiktionary
languages['Serbo-Croatian'] = 'sh'
Section finder:
def get_section_id(data: dict, keyword: str, language: str):
try:
sections = data['parse']['sections']
except (KeyError, TypeError):
return None
section_language = None
for section in sections:
if section['line'] in languages:
section_language = section['line']
continue
if section['line'] == keyword and section_language == language:
return int(section['index'])
return None
Processing:
import requests
from bs4 import BeautifulSoup
url = f'https://en.wiktionary.org/w/api.php'
word = 'māsa'
params = {
'action': 'parse',
'format': 'json',
'page': word,
'prop': 'sections',
'disabletoc': True
}
response = requests.get(url, params=params)
section_id = get_section_id(response.json(), 'Declension', 'Latvian')
params = {
'action': 'parse',
'format': 'json',
'page': word,
'prop': 'text',
'section': section_id,
'disabletoc': True
}
response = requests.get(url, params=params)
# word may not exist, may raise exceptions
html = response.json()['parse']['text']['*']
soup = BeautifulSoup(html, 'html.parser')
Result:
result = {
"word": word,
"singular_declension": {
"nominative": soup.find("span", class_="nom|s-form-of").text,
"genitive": soup.find("span", class_="gen|s-form-of").text,
"dative": soup.find("span", class_="dat|s-form-of").text,
"accusative": soup.find("span", class_="acc|s-form-of").text,
"vocative": soup.find("span", class_="voc|s-form-of").text,
"locative": soup.find("span", class_="loc|s-form-of").text,
"instrumental": soup.find("span", class_="ins|s-form-of").text
},
"plural_declension": {
"nominative": soup.find("span", class_="nom|p-form-of").text,
"genitive": soup.find("span", class_="gen|p-form-of").text,
"dative": soup.find("span", class_="dat|p-form-of").text,
"accusative": soup.find("span", class_="acc|p-form-of").text,
"vocative": soup.find("span", class_="voc|p-form-of").text,
"locative": soup.find("span", class_="loc|p-form-of").text,
"instrumental": soup.find("span", class_="ins|p-form-of").text
}
}
print(result)
{'word': 'māsa',
'singular_declension': {'nominative': 'māsa', 'genitive': 'māsas',
'dative': 'māsai', 'accusative': 'māsu',
'vocative': 'māsa', 'locative': 'māsā',
'instrumental': 'māsu'},
'plural_declension': {'nominative': 'māsas', 'genitive': 'māsu',
'dative': 'māsām', 'accusative': 'māsas',
'vocative': 'māsas', 'locative': 'māsās',
'instrumental': 'māsām'}}
Upvotes: 1