Reputation: 409
I'm trying to erase some empty text
tags in an XML file returned by a Python function, but I get this error: TypeError: object of type 'lxml.etree._ElementTree' has no len()
. Why?
This is the function:
def due(pdfpath):
ntree = uniform_cm(pdfpath)
etree.strip_tags(ntree, 'textline')
# Search for all text "textbox" elements
for textbox in ntree.xpath('//textbox'):
new_line = etree.Element("new_line")
previous_bb = None
# From a given textbox element, iterate over all the "text" elements
for x in textbox.iter("text"):
# Get current bb valu
bb = getBBoxFirstValue(x)
# Check current and past values aren't empty
if bb is not None and previous_bb is not None and (bb - previous_bb) > 20:
# Inserte newline into parent tag
x.getparent().insert(x.getparent().index(x), new_line)
# A new "new_line" element is created
new_line = etree.Element("new_line")
# Append current element is new_line tag
new_line.append(x)
# Keep latest non empty BBox 1st value
if bb is not None:
previous_bb = bb
# Add last new_line element if not null
textbox.append(new_line)
tree = ntree
soup = BeautifulSoup(tree, "lxml")
for x in soup.find_all():
if len(x.get_text(strip=True)) == 0:
x.extract()
return tree
Upvotes: 1
Views: 1297
Reputation: 30971
The only case of len in your code sample is:
if len(x.get_text(strip=True)) == 0:
But I checked type(x)
and got bs4.element.Tag
,
whereas in your error message is 'lxml.etree._ElementTree' has no len()
.
So apparently your error occurred in some other place.
An advice for the future: When you look for a cause of an exception, state precisely in which line it occurred. The StackTrace contains indication on this matter.
So I performed some investigation without any connection with your code sample.
When you parse an XML file using lxml, e.g.:
from lxml import etree as et
tree = et.parse('Input.xml')
the type of tree (the whole XML document) is just lxml.etree._ElementTree.
When you now attempt to run: len(tree)
you will get just:
TypeError: object of type 'lxml.etree._ElementTree' has no len()
But when you read a root element from this tree: root = tree.getroot()
,
the type of root is lxml.etree._Element (note that now you have
an Element not the whole document) and you can run len(root)
,
getting the number of its (direct) children. The same for any other
element it the XML tree.
Note also the following inconsistency in lxml:
When you read XML content from a string, i.e.: root = et.XML(some_text_variable)
the result is the root element, not the document tree.
And now you can call len(root).
Upvotes: 2