Harry Tom
Harry Tom

Reputation: 424

How to get clean json from wikipedia API

I want to get the result from a wikipedia page https://en.wikipedia.org/wiki/February_2 as JSON.

I tried using their API: https://en.wikipedia.org/w/api.php?action=parse&page=February_19&prop=text&formatversion=2&format=json

Though it is giving it as Json format. The content is HTML. I want only the content.

I need a way to get clean result.

Upvotes: 0

Views: 2234

Answers (2)

logi-kal
logi-kal

Reputation: 7880

If you want plain text without markup, you have first to parse the JSON object and then extract the text from the HTML code:

function htmlToText(html) {
   let tempDiv = document.createElement("div");
   tempDiv.innerHTML = html;
   return tempDiv.textContent || tempDiv.innerText || "";
}

const url = 'https://en.wikipedia.org/w/api.php?action=parse&page=February_19&prop=text&format=json&formatversion=2&origin=*';

$.getJSON(url, function(data) {
  const html = data['parse']['text'];
  const plainText = htmlToText(html);
  const array = [...plainText.matchAll(/^\d{4} *–.*/gm)].map(x=>x[0]);
  console.log(array);
});
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>

Update: I edited the code above according to the comment below. Now the function extracts all the list items putting them into an array.

Upvotes: 1

AXO
AXO

Reputation: 9086

I guess by clean you mean the source wikitext. In that case you can use the revisions module:

https://en.wikipedia.org/w/api.php?action=query&titles=February_2&prop=revisions&rvprop=content&formatversion=2&format=json

See API:Get the contents of a page and API:Revisions for more info.

Upvotes: 0

Related Questions