horoyoi o
horoyoi o

Reputation: 682

Fetching text from Wikipedia’s Infobox in Python

want to get infobox contents of https://en.wikipedia.org/wiki/Air_Alg%C3%A9rie

I followed this article.

import requests 
from lxml import etree 

url='https://en.wikipedia.org/wiki/Air_Alg%C3%A9rie'

req = requests.get(url)  

store = etree.fromstring(req.text) 

# this will give Motto portion of above  
# URL's info box of Wikipedia's page 
output = store.xpath('//table[@class="infobox vcard"]/tr[th/text()="Destinations"]/td/i')  

# printing the text portion 
print output[0].text   

but it is null enter image description here

even though req.text exists, returns null. How can I get this infobox contents? especially,

IATA ICAO
AH DAH

I need IATA, ICAO code. please help.

Also remember that DBPedia is not synchronized in real-time with Wikipedia, you may experience a few months delay between wikipedia version and corresponding entry in DBPedia. I don't want DBPedia contents.

Upvotes: 0

Views: 561

Answers (1)

furas
furas

Reputation: 142691

To get AH, DAH, AIR ALGERIE you can use

xpath( '//td[@class="nickname"]' ) 

As for your xpath: in this HTML there is <tbody> between <table> and <tr> so you would have to use it in xpath

'//table[@class="infobox vcard"]/tbody/tr[th/text()="Destinations"]/td'

or use // and it will work even if there is more tags between <table> and <tr>

'//table[@class="infobox vcard"]//tr[th/text()="Destinations"]/td'

I also skiped <i> at the end because row "Destinations" doesn't use <i>


import requests 
from lxml import etree 

url='https://en.wikipedia.org/wiki/Air_Alg%C3%A9rie'

req = requests.get(url)  
store = etree.fromstring(req.text) 

output = store.xpath('//td[@class="nickname"]')  
for x in output:
    print(x.text.strip())

#output = store.xpath('//table[@class="infobox vcard"]//tr[th/text()="Destinations"]/td')
output = store.xpath('//table[@class="infobox vcard"]/tbody/tr[th/text()="Destinations"]/td')
print(output[0].text) 

Result

AH
DAH
AIR ALGERIE
69

EDIT:

I use another xpath to get names "IATA", "ICAO", "Callsign" and then I use zip() to groups them with "AH", "DAH", "AIR ALGERIE"

import requests 
from lxml import etree 

url='https://en.wikipedia.org/wiki/Air_Alg%C3%A9rie'

req = requests.get(url)  
store = etree.fromstring(req.text) 

keys = store.xpath('//table[@class="infobox vcard"]//table//tr[1]//a')
#for x in keys:
#    print(x.text.strip())

values = store.xpath('//td[@class="nickname"]')  
#for x in values:
#    print(x.text.strip())

some_dict = dict()

for k, v in zip(keys, values):
    k = k.text.strip()
    v = v.text.strip()
    some_dict[k] = v
    print(k, '=', v)

print(some_dict)

Result:

IATA = AH
ICAO = DAH
Callsign = AIR ALGERIE

{'IATA': 'AH', 'ICAO': 'DAH', 'Callsign': 'AIR ALGERIE'}

Upvotes: 1

Related Questions