catwiesel81
catwiesel81

Reputation: 11

python lxml xpath AttributeError (NoneType) with correct xpath and usually working

I am trying to migrate a forum to phpbb3 with python/xpath. Although I am pretty new to python and xpath, it is going well. However, I need help with an error.

(The source file has been downloaded and processed with tagsoup.)

Firefox/Firebug show xpath: /html/body/table[5]/tbody/tr[position()>1]/td/a[3]/b

(in my script without tbody)

Here is an abbreviated version of my code:

forumfile="morethread-alte-korken-fruchtweinkeller-89069-6046822-0.html"
XPOSTS = "/html/body/table[5]/tr[position()>1]"
t = etree.parse(forumfile)
allposts = t.xpath(XPOSTS)

XUSER = "td[1]/a[3]/b"
XREG = "td/span"
XTIME = "td[2]/table/tr/td[1]/span"
XTEXT = "td[2]/p"
XSIG = "td[2]/i"
XAVAT = "td/img[last()]"

XPOSTITEL = "/html/body/table[3]/tr/td/table/tr/td/div/h3"
XSUBF = "/html/body/table[3]/tr/td/table/tr/td/div/strong[position()=1]"

for p in allposts:
    unreg=0
    username = None
    username = p.find(XUSER).text          #this is where it goes haywire

When the loop hits user "tompson" / position()=11 at the end of the file, I get

AttributeError: 'NoneType' object has no attribute 'text'

I've tried a lot of try except else finallys, but they weren't helpful.

I am getting much more information later in the script such as date of post, date of user registry, the url and attributes of the avatar, the content of the post...

The script works for hundreds of other files/sites of this forum.

This is no encode/decode problem. And it is not "limited" to the XUSER part. I tried to "hardcode" the username, then the date of registry will fail. If I skip those, the text of the post (code see below) will fail...

#text of getpost
text = etree.tostring(p.find(XTEXT),pretty_print=True)

Now, this whole error would make sense if my xpath would be wrong. However, all the other files and the first numbers of users in this file work. it is only this "one" at position()=11

Is position() uncapable of going >10 ? I don't think so? Am I missing something?

Upvotes: 0

Views: 1821

Answers (1)

catwiesel81
catwiesel81

Reputation: 11

Question answered!

I have found the answer...

I must have been very tired when I tried to fix it and came here to ask for help. I did not see something quite obvious...

The way I posted my problem, it was not visible either.

  • the HTML I downloaded and processed with tagsoup had an additional tag at position 11... this was not visible on the website and screwed with my xpath (It probably is crappy html generated by the forum in combination with tagsoups attempt to make it parseable) out of >20000 files less than 20 are afflicted, this one here just happened to be the first...

  • additionally sometimes the information is in table[4], other times in table[5]. I did account for this and wrote a function that will determine the correct table. Although I tested the function a LOT and thought it working correctly (hence did not inlcude it above), it did not. So I made a better xpath:

    '/html/body/table[tr/td[@width="20%"]]/tr[position()>1]'

and, although this is not related, I ran into another problem with unxpected encoding in the html file (not utf-8) which was fixed by adding:

parser = etree.XMLParser(encoding='ISO-8859-15')  
t = etree.parse(forumfile, parser)

I am now confident that after adjusting for strange additional and multiple , and tags my code will work on all files...

Still I will be looking into lxml.html, as I mentioned in the comment, I have never used it before, but if it is more robust and may allow for using the files without tagsoup, it might be a better fit and save me extensive try/except statements and loops to fix the few files screwing with my current script...

Upvotes: 1

Related Questions