Ohsik
Ohsik

Reputation: 261

wikipedia api call for specific content on the page

How to make Wikipedia API call to get the top 5 airports name, location, country on this page?

http://en.wikipedia.org/wiki/List_of_the_world%27s_busiest_airports_by_passenger_traffic

Upvotes: 1

Views: 213

Answers (1)

Ruben Marrero
Ruben Marrero

Reputation: 1392

Here you can see all the json you need prettyfied:

http://en.wikipedia.org/w/api.php?format=jsonfm&action=query&titles=List_of_the_world's_busiest_airports_by_passenger_traffic&prop=revisions&rvprop=content

Change ?format=jsonfm for just ?format=json, and you would get just the usefull data.

Solution:

You will get all the lists' rows by executing this command on linux:

curl http://en.wikipedia.org/w/api.php?format=json\&action=query\&titles=List_of_the_world\'s_busiest_airports_by_passenger_traffic\&prop=revisions\&rvprop=content | sed 's|\\u||g' | grep -onE '\\n\|[0-9]+\.\|\|[^\\]*'

Each line prompted in the output is each airport in rank order (30 or 50 airports per list depending of the list).

And this command would prompt its names without anything else:

curl http://en.wikipedia.org/w/api.php?format=json\&action=query\&titles=List_of_the_world\'s_busiest_airports_by_passenger_traffic\&prop=revisions\&rvprop=content | sed 's|\\u||g' | grep -onE '\\n\|[0-9]+\.\|\|[^\\]*' | grep -onE '} \[\[[^[\]*]' | sed 's/[\[|:}]//g; s/]]//; s/[0-9][0-9]*//g; s/ //' 

Notice: All page's lists are concatenated, so the last one wouldn't be actually the number 600, but the first 30 are its real numbers, each 30 or 50 (depending of the list you are looking at) from there is a different list.

Explanation:

I got the url endpoint from here and then used curl for doing a GET request to the wikipedia's API, which fetch all the available data on the page you requested, and then I'm using regular expressions to parse the needed values. The regular expressions I'm using are:

sed 's|\\u||g' 

this one is being performed by sed (stream editor) and what it does is to search for every appearance of \u (which stands for unicode characters) and removing it. I need to do that because later I will use the string '\n' (which stands for new line) as separator for the rows. The way it does what I say it does is by using the command s of sed for substituting every appearance of the string \u, the reason of being two back slashes is because it needs to be escaped or it would be interpreted as a part of the command.

grep -onE '\\n\|[0-9]+\.\|\|[^\\]*'

This regular expression is being performed by grep, the first we do (as mentioned before) is to match any new line which would be \n, again, we need to escape the back slash. Then we need to match the character | and it needs to be escaped too. Then we need to match any amount of digits with [0-9]+ everything inside [] would be a character, 0-9 is the range we want to match and + stands for one or more,we also want the character . which also needs to be escaped and then two times this character again: |. At this point we already matched the index and now we want to match every single character until the end of line, which would be '\n', but since we've already deleted the useless \u , all the back slashes left are for new lines, so, here is the match we need: [\\], but we want to negate it, thats why we add the ^ in front of the back slashes, and then the * would match zero or more unknown characters which aren't back slashes. The -onE in front of the regular expression are the options passed to grep and its meanings are o = only match , n = number each line and E = extended regular expression.

grep -onE '} \[\[[^\]*]]'

At this point we have all the rows with all the available data in each of them and we want to fetch just the names which are enclosed within [[...]] and always after a } , this is the same as before but the character we don't want this time is ] instead of \

sed 's/[\[|:}]//g; s/]]//; s/[0-9][0-9]*//g; s/ //'

The only thing this sed command does is to delete all non-alphabetical characters by grouping them within [] and substituting them with nothing. Maybe it isn't the more efficient way to do it, but it works.

Important: I noticed right now that there were some spaces within the json and I had to tweak the regular expression a bit more, I wont change the above explanation since I've only added some ? whenever it could be a whitespace.

curl http://en.wikipedia.org/w/api.php?format=json\&action=query\&titles=List_of_the_world\'s_busiest_airports_by_passenger_traffic\&prop=revisions\&rvprop=content | sed 's|\\u||g' | grep -E '\\n\|[0-9]+\.\|\|[^\\]*'  | grep -onE '} ?\[\[[^[\]*]' | sed 's/[\[|:}]//g; s/]]//; s/[0-9][0-9]*//g; s/ //'

and here you have the output on pastebin .

Further lecture: this link would help you to use regular expressions with javascript.

No need of curl: You can test what any request outputs in here

Upvotes: 1

Related Questions