How to get clean json from wikipedia API

Question

I want to get the result from a wikipedia page https://en.wikipedia.org/wiki/February_2 as JSON.

I tried using their API: https://en.wikipedia.org/w/api.php?action=parse&page=February_19&prop=text&formatversion=2&format=json

Though it is giving it as Json format. The content is HTML. I want only the content.

I need a way to get clean result.

logi-kal · Accepted Answer

If you want plain text without markup, you have first to parse the JSON object and then extract the text from the HTML code:

function htmlToText(html) {
   let tempDiv = document.createElement("div");
   tempDiv.innerHTML = html;
   return tempDiv.textContent || tempDiv.innerText || "";
}

const url = 'https://en.wikipedia.org/w/api.php?action=parse&page=February_19&prop=text&format=json&formatversion=2&origin=*';

$.getJSON(url, function(data) {
  const html = data['parse']['text'];
  const plainText = htmlToText(html);
  const array = [...plainText.matchAll(/^\d{4} *–.*/gm)].map(x=>x[0]);
  console.log(array);
});

Update: I edited the code above according to the comment below. Now the function extracts all the list items putting them into an array.

How to get clean json from wikipedia API

Answers (2)

Related Questions