mchlsctt
mchlsctt

Reputation: 115

How to use lxml to find element text in XHTML document

I've been bashing my head at this for ages, I must be doing something stupid.

I am trying to retrieve all of the possible Wikipedia supported languages and output them to a text file by traversing the tables on List_of_Wikipedias.

Here is my python code so far, which is simply trying to retrieve one of the tables:

import httplib
from lxml import etree

def main():
    conn = httplib.HTTPConnection("meta.wikimedia.org")
    conn.request("GET","/wiki/List_of_Wikipedias")
    res = conn.getresponse()
    root = etree.fromstring(res.read())
    table = root.xpath('//table')
    print table

main()

On my machine this only prints an empty list. To increase speed I cached the page locally and used:

wikipage = open("wikipage.html")
root = lxml.parse(wikipage)

but this makes no impact whatsoever (other than the obvious speedup). I have also tried

lxml.find('table')

and:

for element in root.iter():
    print("%s - %s" % (element.tag, element.text))

which successfully prints out all of the elements, so I know the tree is being created.

What am I doing wrong?

Any help would be appreciated. Thanks.

Upvotes: 5

Views: 5216

Answers (3)

jfs
jfs

Reputation: 414069

Parse it as html.

from lxml import html

url = 'http://meta.wikimedia.org/wiki/List_of_Wikipedias'
tree = html.parse(url)
languages = tree.xpath('//table/tr/td[2]/a/text()')
print('\n'.join(languages))

Output

English
German
French
Polish
Italian
Japanese
Spanish
Portuguese
Dutch
Russian
Swedish
Chinese
Catalan
Norwegian (Bokmål)
Finnish
Ukrainian
Czech
Hungarian
Romanian
Korean
Turkish
Vietnamese
Indonesian
Danish
Arabic
Esperanto
Serbian
Lithuanian
Slovak
Volapük
Persian
Hebrew
Bulgarian
Slovenian
Malay
Waray-Waray
Croatian
Estonian
Newar / Nepal Bhasa
Simple English
Hindi
Galician
Thai
Basque
Norwegian (Nynorsk)
Aromanian
Greek
Haitian
Azerbaijani
Tagalog
Latin
Telugu
Georgian
Macedonian
Cebuano
Serbo-Croatian
Breton
Piedmontese
Marathi
Latvian
Luxembourgish
Javanese
Belarusian (Taraškievica)
Welsh
Icelandic
Bosnian
Albanian
Tamil
Belarusian
Bishnupriya Manipuri
Aragonese
Occitan
Bengali
Swahili
Ido
Lombard
West Frisian
Gujarati
Afrikaans
Low Saxon
Malayalam
Quechua
Sicilian
Urdu
Kurdish
Cantonese
Sundanese
Asturian
Neapolitan
Samogitian
Armenian
Yoruba
Irish
Chuvash
Walloon
Nepali
Ripuarian
Western Panjabi
Kannada
Tajik
Tarantino
Venetian
Yiddish
Scottish Gaelic
Tatar
Min Nan
Ossetian
Uzbek
Alemannic
Kapampangan
Sakha
Kazakh
Egyptian Arabic
Maori
Amharic
Limburgian
Nahuatl
Upper Sorbian
Gilaki
Corsican
Gan
Mongolian
Scots
Interlingua
Central_Bicolano
Burmese
Faroese
Võro
Dutch Low Saxon
Sinhalese
Turkmen
West Flemish
Sanskrit
Bavarian
Malagasy
Manx
Ilokano
Divehi
Norman
Pangasinan
Banyumasan
Sorani
Romansh
Northern Sami
Zazaki
Mazandarani
Wu
Friulian
Uyghur
Ligurian
Maltese
Bihari
Novial
Tibetan
Anglo-Saxon
Kashubian
Sardinian
Classical Chinese
Fiji Hindi
Khmer
Ladino
Zamboanga Chavacano
Pali
Franco-Provençal/Arpitan
Pashto
Hakka
Cornish
Punjabi
Navajo
Silesian
Kalmyk
Pennsylvania German
Hawaiian
Saterland Frisian
Interlingue
Somali
Komi
Karachay-Balkar
Crimean Tatar
Tongan
Acehnese
Meadow Mari
Picard
Kinyarwanda
Erzya
Lingala
Extremaduran
Guarani
Kirghiz
Emilian-Romagnol
Assyrian Neo-Aramaic
Papiamentu
Aymara
Chechen
Lojban
Wolof
Banjar
Bashkir
North Frisian
Greenlandic
Tok Pisin
Udmurt
Kabyle
Tahitian
Sranan
Zealandic
Hill Mari
Komi-Permyak
Lower Sorbian
Abkhazian
Gagauz
Igbo
Oriya
Lao
Kongo
Avar
Moksha
Mirandese
Romani
Old Church Slavonic
Karakalpak
Samoan
Moldovan
Tetum
Gothic
Kashmiri
Bambara
Inupiak
Sindhi
Bislama
Lak
Nauruan
Norfolk
Inuktitut
Pontic
Assamese
Cherokee
Min Dong
Palatinate German
Swati
Hausa
Ewe
Tigrinya
Oromo
Zulu
Zhuang
Venda
Tsonga
Kirundi
Cree
Dzongkha
Sango
Chamorro
Luganda
Buginese
Buryat (Russia)
Fijian
Chichewa
Akan
Sesotho
Xhosa
Fula
Tswana
Kikuyu
Tumbuka
Shona
Twi
Cheyenne
Ndonga
Sichuan Yi
Choctaw
Marshallese
Afar
Kuanyama
Hiri Motu
Muscogee
Kanuri
Herero

Upvotes: 3

Dimitre Novatchev
Dimitre Novatchev

Reputation: 243459

I am trying to retrieve all of the possible Wikipedia supported languages and output them to a text file by traversing the tables on List_of_Wikipedias

Your problem is that the element names in the document are in a default namespace. How to write XPath expressions that involve such element names is the most FAQ in XPath and has numerous good answer in the SO xpath tag. Just search for them.

Here is a complete solution:

Use:

(//x:table)[1]/x:tr[not(x:th)]/x:td[2]//text()

where you have registered the XHTML namespace ("http://www.w3.org/1999/xhtml") bound to the prefix "x".

When I evaluated this XPath expression against the document obtained from: http://s23.org/wikistats/wikipedias_html

I needed to add the following at the start of the document, because I was working locally and didn't have the DTD for XHTML -- maybe you will not need these:

<!DOCTYPE html [
<!ENTITY uarr "&#8593;">
<!ENTITY darr "&#8595;">
<!ENTITY ccedil "&#199;">
<!ENTITY oslash "&#216;">
<!ENTITY aacute "&#225;">
<!ENTITY aring "&#229;">
<!ENTITY agrave "&#192;">
<!ENTITY egrave "&#232;">
<!ENTITY ograve "&#210;">
<!ENTITY ocirc "&#244;">
]>

The result of applying the above XPath expression to this document is:

                    English

                    German

                    French

                    Polish

                    Italian

                    Japanese

                    Spanish

                    Portuguese

                    Dutch

                    Russian

                    Swedish

                    Chinese

                    Catalan

                    Norwegian (Bokmål)

                    Finnish

                    Ukrainian

                    Czech

                    Hungarian

                    Romanian

                    Korean

                    Turkish

                    Vietnamese

                    Indonesian

                    Danish

                    Arabic

                    Esperanto

                    Serbian

                    Lithuanian

                    Slovak

                    Volapük

                    Persian

                    Hebrew

                    Bulgarian

                    Slovenian

                    Malay

                    Waray-Waray

                    Croatian

                    Estonian

                    Newar / Nepal Bhasa

                    Simple English

                    Hindi

                    Galician

                    Thai

                    Basque

                    Norwegian (Nynorsk)

                    Aromanian

                    Greek

                    Haitian

                    Azerbaijani

                    Tagalog

                    Latin

                    Telugu

                    Georgian

                    Macedonian

                    Cebuano

                    Serbo-Croatian

                    Breton

                    Piedmontese

                    Marathi

                    Latvian

                    Luxembourgish

                    Javanese

                    Belarusian (Taraškievica)

                    Welsh

                    Icelandic

                    Bosnian

                    Albanian

                    Tamil

                    Belarusian

                    Bishnupriya Manipuri

                    Aragonese

                    Occitan

                    Bengali

                    Swahili

                    Ido

                    Lombard

                    West Frisian

                    Gujarati

                    Afrikaans

                    Low Saxon

                    Malayalam

                    Quechua

                    Sicilian

                    Urdu

                    Kurdish

                    Cantonese

                    Sundanese

                    Asturian

                    Neapolitan

                    Samogitian

                    Armenian

                    Yoruba

                    Irish

                    Chuvash

                    Walloon

                    Nepali

                    Ripuarian

                    Western Panjabi

                    Kannada

                    Tajik

                    Tarantino

                    Venetian

                    Yiddish

                    Scottish Gaelic

                    Tatar

                    Min Nan

                    Ossetian

                    Uzbek

                    Alemannic

                    Kapampangan

                    Sakha

                    Egyptian Arabic

                    Kazakh

                    Maori

                    Limburgian

                    Amharic

                    Nahuatl

                    Upper Sorbian

                    Gilaki

                    Corsican

                    Gan

                    Mongolian

                    Scots

                    Interlingua

                    Central_Bicolano

                    Burmese

                    Faroese

                    Võro

                    Dutch Low Saxon

                    Sinhalese

                    Turkmen

                    West Flemish

                    Sanskrit

                    Bavarian

                    Malagasy

                    Manx

                    Ilokano

                    Divehi

                    Norman

                    Pangasinan

                    Banyumasan

                    Sorani

                    Romansh

                    Northern Sami

                    Zazaki

                    Mazandarani

                    Wu

                    Friulian

                    Uyghur

                    Ligurian

                    Maltese

                    Bihari

                    Novial

                    Tibetan

                    Anglo-Saxon

                    Kashubian

                    Sardinian

                    Classical Chinese

                    Fiji Hindi

                    Khmer

                    Ladino

                    Zamboanga Chavacano

                    Pali

                    Franco-Provençal/Arpitan

                    Pashto

                    Hakka

                    Cornish

                    Punjabi

                    Navajo

                    Silesian

                    Kalmyk

                    Pennsylvania German

                    Hawaiian

                    Saterland Frisian

                    Interlingue

                    Somali

                    Komi

                    Karachay-Balkar

                    Crimean Tatar

                    Tongan

                    Acehnese

                    Meadow Mari

                    Picard

                    Erzya

                    Lingala

                    Kinyarwanda

                    Extremaduran

                    Guarani

                    Kirghiz

                    Emilian-Romagnol

                    Assyrian Neo-Aramaic

                    Papiamentu

                    Aymara

                    Chechen

                    Lojban

                    Wolof

                    Banjar

                    Bashkir

                    North Frisian

                    Greenlandic

                    Tok Pisin

                    Udmurt

                    Kabyle

                    Tahitian

                    Sranan

                    Zealandic

                    Hill Mari

                    Komi-Permyak

                    Lower Sorbian

                    Abkhazian

                    Gagauz

                    Igbo

                    Oriya

                    Lao

                    Kongo

                    Avar

                    Moksha

                    Mirandese

                    Romani

                    Old Church Slavonic

                    Karakalpak

                    Samoan

                    Moldovan

                    Tetum

                    Gothic

                    Kashmiri

                    Bambara

                    Inupiak

                    Sindhi

                    Bislama

                    Lak

                    Nauruan

                    Norfolk

                    Inuktitut

                    Pontic

                    Assamese

                    Cherokee

                    Min Dong

                    Swati

                    Palatinate German

                    Hausa

                    Ewe

                    Tigrinya

                    Oromo

                    Zulu

                    Zhuang

                    Venda

                    Tsonga

                    Kirundi

                    Dzongkha

                    Sango

                    Cree

                    Chamorro

                    Luganda

                    Buginese

                    Buryat (Russia)

                    Fijian

                    Chichewa

                    Akan

                    Sesotho

                    Xhosa

                    Fula

                    Tswana

                    Kikuyu

                    Tumbuka

                    Shona

                    Twi

                    Cheyenne

                    Ndonga

                    Sichuan Yi

                    Choctaw

                    Marshallese

                    Afar

                    Kuanyama

                    Hiri Motu

                    Muscogee

                    Kanuri

                    Herero

Do note: Every second selected node is a white-space-only text node. If you don't want these selected, use:

(//x:table)[1]/x:tr[not(x:th)]/x:td[2]//text()[normalize-space()]

Upvotes: 3

peter.murray.rust
peter.murray.rust

Reputation: 38033

XPath requires namespaces. The page you have downloaded starts:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" dir="ltr">

So you actually want

xpath('//html:table')

where html is the prefix bound to "http://www.w3.org/1999/xhtml"

You will have to find out how to bind namespaces in lxml - I am not a python expert.

If this is your problem I sympathize - it has caught me and many others out!

Upvotes: 0

Related Questions