user3241376
user3241376

Reputation: 417

lxml strip_tags results in AttributeError

I need to clean an html file, e.g. delete redundant 'span' tags. A 'span' is considered redundant, if it has the same format as its parent-node for font-weight and font-style in the css file (which I converted to a dictionary for a faster look up).

The html file looks like this:

<p class="Title">blablabla <span id = "xxxxx">bla</span> prprpr <span id = "yyyyy"> jj </span> </p>
<p class = "norm">blalbla <span id = "aaaa">ttt</span> sskkss <span id = "bbbbbb"> aa </span> </p>

The css styling which I already stocked into a dictionary:

{'xxxxx':'font-weight: bold; font-size: 8.0pt; font-style: oblique', 
 'yyyyy':'font-weight: normal; font-size: 9.0pt; font-style: italic', 
 'aaaa': 'font-weight: bold; font-size: 9.0pt; font-style: italic', 
 'bbbbbb': 'font-weight: normal; font-size: 9.0pt; font-style: normal', 
 'Title': 'font-style: oblique; text-align: center; font-weight: bold', 
 'norm': 'font-style: normal; text-align: center; font-weight: normal'}

So, given that <p Title> and <span id xxxxx>, and <p norm> and <span bbbbbb> have the same formatting in for font-weight and font-style in the css-dictionary, I want to get the following result:

<p class= "Title">blablabla bla prprpr <span id = "yyyyy"> jj </span> </p>
<p class = "norm">blalbla <span id = "aaaa">ttt</span> sskkss aa </span> </p>

Also, there are spans that I can delete just by looking at their id: if it contains "af" - I delete them without looking to the dictionary.

So, in my script there is:

from lxml import etree
from asteval import Interpreter

tree = etree.parse("filename.html")

aeval = Interpreter()
filedic = open('dic_file', 'rb')
fileread = filedic.read()
new_dic = aeval(fileread)

def no_af(tree):

  for badspan in tree.xpath("//span[contains(@id, 'af')]"):
      badspan.getparent().remove(badspan)

  return tree

def no_normal():
    no_af(tree)

  for span in tree.xpath('.//span'):
      span_id = span.xpath('@id')

      for x in span_id:
          if x in new_dic:
               get_style = x
               parent = span.getparent()
               par_span =parent.xpath('@class')
               if par_span:
                     for ID in par_span:
                        if ID in new_dic:

                           get_par_style = ID
                           if 'font-weight' in new_dic[get_par_style] and 'font-style' in new_dic[get_par_style]:

                              if 'font-weight' in new_dic[get_style] and 'font-style' in new_dic[get_style]:

                                 if new_dic[get_par_style]['font-weight']==new_dic[get_style]['font-weight'] and new_dic[get_par_style]['font-style']==new_dic[get_style]['font-style']:

                                     etree.strip_tags(parent, 'span')

    print etree.tostring(tree, pretty_print =True, method = "html", encoding = "utf-8")

This results in:

AttributeError: 'NoneType' object has no attribute 'xpath'

And I know that it is exactly the line "etree.strip_tags(parent, 'span')" which causes the error, because when I comment it out, and make print smth after any other line - everything works.

Also, I am not sure, whether using this etree.strip_tags(parent, 'span') will do what I need. What if inside of the parent there are several spans with different formatting. Will this command strip all these spans anyway? I need actualy to strip only one span, the current one, which is taken at the beginning of the function, in "for span in tree.xpath('.//span'):"

I have been looking at this bug for the whole day, I think I am overlooking something... I desperately need your help!

Upvotes: 1

Views: 710

Answers (1)

Jonathan Eunice
Jonathan Eunice

Reputation: 22463

lxml is great, but it provides a pretty low-level "etree" data structure, and does not have the most extensive set of editing operations built-in. What you need is an "unwrap" operation that you can apply to individual elements to keep their text, any child elements, and their "tail" in the tree, but not the element itself. Here is such an operation (plus a needed helper function):

def noneCat(*args):
    """
    Concatenate arguments. Treats None as the empty string, though it returns
    the None object if all the args are None. That might not seem sensible, but
    it works well for managing lxml text components.
    """
    for ritem in args:
        if ritem is not None:
            break
    else:
        # Executed only if loop terminates through normal exhaustion, not via break
        return None

    # Otherwise, grab their string representations (empty string for None)
    return ''.join((unicode(v) if v is not None else "") for v in args)


def unwrap(e):
    """
    Unwrap the element. The element is deleted and all of its children
    are pasted in its place.
    """
    parent = e.getparent()
    prev = e.getprevious()

    kids = list(e)
    siblings = list(parent)

    # parent inherits children, if any
    sibnum = siblings.index(e)
    if kids:
        parent[sibnum:sibnum+1] = kids
    else:
        parent.remove(e)

    # prev node or parent inherits text
    if prev is not None:
        prev.tail = noneCat(prev.tail, e.text)
    else:
        parent.text = noneCat(parent.text, e.text)

    # last child, prev node, or parent inherits tail
    if kids:
        last_child = kids[-1]
        last_child.tail = noneCat(last_child.tail, e.tail)
    elif prev is not None:
        prev.tail = noneCat(prev.tail, e.tail)
    else:
        parent.text = noneCat(parent.text, e.tail)
    return e

Now you've done part of the work to decompose CSS and determine if one CSS selector (span#id) indicates what you want to consider a redundant specification to another selector (p.class). Let's extend that and wrap it into a function:

cssdict = { 'xxxxx':'font-weight: bold; font-size: 8.0pt; font-style: oblique',
            'yyyyy':'font-weight: normal; font-size: 9.0pt; font-style: italic',
            'aaaa': 'font-weight: bold; font-size: 9.0pt; font-style: italic',
            'bbbbbb': 'font-weight: normal; font-size: 9.0pt; font-style: normal',
            'Title': 'font-style: oblique; text-align: center; font-weight: bold',
            'norm': 'font-style: normal; text-align: center; font-weight: normal'
          }

RELEVANT = ['font-weight', 'font-style']

def parse_css_spec(s):
    """
    Decompose CSS style spec into a dictionary of its components.
    """
    parts = [ p.strip() for p in s.split(';') ]
    attpairs = [ p.split(':') for p in parts ]
    attpairs = [ (k.strip(), v.strip()) for k,v in attpairs ]
    return dict(attpairs)

cssparts = { k: parse_css_spec(v) for k,v in cssdict.items() }
# pprint(cssparts)

def redundant_span(span_css_name, parent_css_name, consider=RELEVANT):
    """
    Determine if a given span is redundant with respect to its parent,
    considering sepecific attribute names. If the span's attributes
    values are the same as the parent's, consider it redundant.
    """
    span_spec = cssparts[span_css_name]
    parent_spec = cssparts[parent_css_name]
    for k in consider:
        # Any differences => not redundant
        if span_spec[k] != parent_spec[k]:
            return False
    # Everything matches => is redundant
    return True

Ok, so preparation done, time for the main show:

import lxml.html
from lxml.html import tostring

source = """
<p class="Title">blablabla <span id = "xxxxx">bla</span> prprpr <span id = "yyyyy"> jj </span> </p>
<p class = "norm">blalbla <span id = "aaaa">ttt</span> sskkss <span id = "bbbbbb"> aa </span> </p>
"""

h = lxml.html.document_fromstring(source)

print "<!-- before -->"
print tostring(h, pretty_print=True)
print

for span in h.xpath('//span[@id]'):
    span_id = span.attrib.get('id', None)
    parent_class = span.getparent().attrib.get('class', None)
    if parent_class is None:
        continue
    if redundant_span(span_id, parent_class):
        unwrap(span)

print "<!-- after -->"
print tostring(h, pretty_print=True)

Yielding:

<!-- before-->
<html><body>
<p class="Title">blablabla <span id="xxxxx">bla</span> prprpr <span id="yyyyy"> jj </span> </p>
<p class="norm">blalbla <span id="aaaa">ttt</span> sskkss <span id="bbbbbb"> aa </span> </p>
</body></html>


<!-- after -->
<html><body>
<p class="Title">blablabla bla prprpr <span id="yyyyy"> jj </span> </p>
<p class="norm">blalbla <span id="aaaa">ttt</span> sskkss  aa  </p>
</body></html>

UPDATE

On second thought, you don't need unwrap. I'm using it because it's conveniently in my toolbox. You can do without it by using a mark-sweep approach along with etree.strip_tags, like this:

for span in h.xpath('//span[@id]'):
    span_id = span.attrib.get('id', None)
    parent_class = span.getparent().attrib.get('class', None)
    if parent_class is None:
        continue
    if redundant_span(span_id, parent_class):
        span.tag = "JUNK"
etree.strip_tags(h, "JUNK")

Upvotes: 2

Related Questions