Reputation: 93
I am trying to extract every textual content from a Wikipedia Page including the tables using API sandbox for the Wikipedia page on Ballon_d'Or.
I tried the given query:
https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts&titles=Ballon_d%27Or&explaintext=1&exsectionformat=wiki
but it provides me only the textual content without the content from the wiki table like this one:
Is there a way I could obtain the table content in a textual format along with the textual information already being obtained?
Alternatively, I can try the web crawling technique using beautiful Soup but I wanted to look for the query method, first.
Upvotes: 2
Views: 1731
Reputation: 7036
Use action
=parse
instead query
:
https://en.wikipedia.org/w/api.php?action=parse&page=Ballon_d'Or&prop=text
By using §ion=2
you will access the second section Winners.
This maybe will help you later also: Regular expression to remove HTML tags
Upvotes: 1