amin
amin

Reputation: 445

Extract parallel text from Wikipedia dumps

In my research project I need to extract parallel documents from Wikipedia dumps. In other words, I have downloaded the English and Italian Wikipedia dumps. Now, I want to parse them and for each article in the English dump, find its translation in the Italian dump (should be done by the Interlanguage links), and store them in the same file to do some cross-lingual text processing afterward.

I searched a little bit for this, but I couldn't find any code for this purpose. But, since I have seen many papers in which the authors have done the same, I thought it might be worth asking first, before inventing the wheel from scratch.

Any idea is appreciated.

Thank you.

Upvotes: 4

Views: 391

Answers (1)

Chiawen
Chiawen

Reputation: 11789

Use this Wikipedia api, action=query&query=langlinks

Example: https://en.wikipedia.org/w/api.php?action=query&prop=langlinks&lllang=it&titles=Calculus|Bread|Biology

The response gives the corresponding Italian articles.

Upvotes: 1

Related Questions