Javad
Javad

Reputation: 313

Article summarization in wikipedia

Wikipedia provides article summaries for most articles when using its search feature (see screenshot below). I have looked at various articles and couldn't find the text in the original article; neither in the rendered page nor in the metadata inside the Edit section.

Now, I have two questions:

  1. How does Wikipedia show these summarizations? Are these precurated texts entered by the community or is there any underlying ML algorithm to summarize articles? In the case of the former, can you point me to the location where these data are sourced? In case of the latter, has the algorithm been open-sourced?

  2. Does Wikipedia API support retrieving these summaries for a given article?

enter image description here

Upvotes: 1

Views: 896

Answers (5)

A Wikipedia Editor
A Wikipedia Editor

Reputation: 11

This is an old question, but the prior answers are no longer fully correct. The history and the current situation are complicated.

Wikidata contain descriptions of Wikidata items. The Wikidata items often link to Wikipedia articles on approximately the same topic. Some developers/managers made a well intentioned but ill informed decision to grab Wikidata's internal item descriptions and slap them onto Wikipedia as if they were article descriptions. They didn't bother to check with the Wikipedia community whether there were any problems with the idea. In most cases the result looked right, but in other cases it was unfixably wrong. It raised other problems as well.

This was fixed on English Wikipedia. You will now usually find these descriptions stored in the page itself, right near the top, when you click the Edit link. It is stored in a {{short description|blah blah blah}} template. In some cases some other template might generate a provisional automatic description. In that case, or if there is no description yet, you can add {{short description|blah blah blah}} at the top of the article. A human-written description will take precedence, overriding any auto-generated description.

Note that, as of today, the problem has not been fixed on other language Wikipedias. There is no current active plan to fix it elsewhere. The reasons are too long, and too off topic, to explain here. Hopefully it will get done at some point. On other language Wikipedias you currently still need to use the Wikidata API as explained in other replies. Just beware that they aren't actually descriptions of articles, which may result in odd or inaccurate descriptions.

Ironic note: In practice, Wikidata pretty much uses English as the One True Language for defining concepts in the universe. If some concept in some language doesn't perfectly align with English, Wikidata pretty much treats the other language as irrelevant or wrong. This means Wikidata descriptions has the highest likelihood of being accurate on English Wikipedia, with a higher rate of broken descriptions other languages. They only fixed the language that least needed to be fixed. Because reasons.

Upvotes: 1

Martin Majlis
Martin Majlis

Reputation: 373

To solve this problem, you can use Wikipedia-API and nltk.

import wikipediaapi
wiki = wikipediaapi.Wikipedia('en')
pizza = wiki.page('Pizza')
print(pizza.fullurl)
print("Summary length: %d" % len(pizza.summary))

# You can either pick first N characters or use some tokenizer
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(pizza.summary)
print("Number of sentences: %d" % len(sentences))
print(sentences[0])

Output:

https://en.wikipedia.org/wiki/Pizza
Summary length: 1690
Number of sentences: 16
Pizza is a traditional Italian dish consisting of a yeasted flatbread typically topped with tomato sauce and cheese and baked in an oven.

Upvotes: 0

Petr
Petr

Reputation: 6269

The short descriptions you're looking at are Wikidata descriptions. There are also available as a description property in the REST API summary endpoint response along with a more verbose extract, the image for the page and a bunch of there info.

Upvotes: 0

Javad
Javad

Reputation: 313

I found the answer to my question. Those summarizations come from Wikidata which is a sister project to Wikipedia. According to Wikidata's Wikipedia page:

Wikidata is a collaboratively edited knowledge base operated by the Wikimedia Foundation. It is intended to provide a common source of data which can be used by Wikimedia projects such as Wikipedia, and by anyone else, under a public domain license.

For instance, the Wikidata page for Pizza is https://www.wikidata.org/wiki/Q177. It has its own API which is described in https://www.wikidata.org/w/api.php.

Upvotes: 0

Paco
Paco

Reputation: 666

For most of the Wikipedia entries, one can generally access a related page on DBpedia. For this example with pizza, http://dbpedia.org/page/Pizza

That also has the benefit of being programmatically accessible. Most of those have the summaries.

Upvotes: 0

Related Questions