da Bich
da Bich

Reputation: 536

LXML parser loses text in data event

I have created a custom Parser for handling my complicated XML conversion. It works great. Has worked great for quite awhile. However I have just noticed that when an incoming XML file has a tag and data such as:

<ParticipantName>STEVE O&apos;NEILL</ParticipantName>

The data event returns just NEILL to me... rather than STEVE O&apos;NEILL or even STEVE O'NEILL

I've been doing reading on lxml.. and I have a feeling it's something to do with the encoding option.. but I'm not really sure from what i've read. Hard to find answers around handling these HTML? characters... Not even sure what encoding I would attempt to just try out..

Right now, the way I create the parser is (parser_target is my custom parser):

                        parser = etree.XMLParser(target=parser_target)
                        raw_records = etree.parse(full_path, parser)

And in the data event, I simply save the value..and return it in the close event.

  def data(self, data):
        """catch the data event on parsing the xml file.
        """

                    my_variable = data

There is a lot more complexity in my code, so just showing the basics here. Does anyone know how to cleanly deal with incoming XML data that have these HTML (i think) characters? I have no control over the file I am getting.. so I need to deal with it as it comes in.

[EDIT] Ok in building the example to share, i think i've found the issue.. It seems the data event is called multiple times when this text occurs.. If anyone sees anythign different / another way of handling through the 'encoding' parameter when creating the parser.. let me know.

Here is the example i build:

main.py:

from lxml import etree

from ParserTarget import ParserTarget


def test():

    parser_target = ParserTarget()

    print('Parsing begins')
    parser = etree.XMLParser(target=parser_target)
    full_path = "/data/test.xml"
    raw_records = etree.parse(full_path, parser)
    print(raw_records)


if __name__ == '__main__':
    test()

ParserTarget.py:

"""parser_target

data class for handling XML parsing
"""
from dataclasses import dataclass


@dataclass
class ParserTarget:

    def __init__(self):
        """initialize the object variables
        """
        self.mydata = ""

    def start(self, tag, attrib):
        """catch the start event on parsing the xml file.

        """
        print("start function: " + tag + " : " + str(attrib))

    def data(self, data):
        """catch the data event on parsing the xml file.

        """
        print("data function: " + data)
        self.mydata = data

    def end(self, tag):
        """catch the end event on parsing the xml file.

        """
        print("end function: " + tag)

    def close(self):
        """catch the close event on parsing the xml file.


        """

        # done

        return self.mydata

and place the following xml file into /data or change the folder in the code how you wish:

test.xml:

<?xml version="1.0" encoding="UTF-8"?>
              <part>
                    <name>STEVE O&apos;NEILL</name>
                    <role tc="9">something - contingent</role>
                    <pct>10</pct>
                    <ind tc="0">False</ind>
                </part>

When I run the code, the output is:

Parsing begins
start function: part : <lxml.etree._ImmutableMapping object at 0x000001FBCDBB9700>
data function: 
                    
start function: name : <lxml.etree._ImmutableMapping object at 0x000001FBCDBB9700>
data function: STEVE O
data function: '
data function: NEILL
end function: name
data function: 
                    
start function: role : {'tc': '9'}
data function: something - contingent
end function: role
data function: 
                    
start function: pct : <lxml.etree._ImmutableMapping object at 0x000001FBCDBB9700>
data function: 10
end function: pct
data function: 
                    
start function: ind : {'tc': '0'}
data function: False
end function: ind
data function: 
                
end function: part

                

Process finished with exit code 0

So note that the data function actually gets called 3 times for the single "name" tag... and it's broken up around this character. Explains why I end up with only the last chunk.

I'll just append from now on.. but please let me know if there is a better way to parse. thanks

Upvotes: 1

Views: 201

Answers (2)

da Bich
da Bich

Reputation: 536

self.mydata = data

needs to be

self.mydata += data

so that all data events for the tag are concatenated. Of course, be sure to blank out the self.mydata at the 'end' event.. or re-initialize at the 'start' event, etc.

I'm not hearing of any other ways to interpret the text using encoding or other options with LXML..

Upvotes: 0

Jim Garrison
Jim Garrison

Reputation: 86774

def data(self, data):
    """catch the data event on parsing the xml file.

    """
    print("data function: " + data)
    self.mydata = data

Your debug output is clear. The text appears as three separate text nodes, and you keep only the last one.

You have to keep track of the start of the tag and concatenate any text nodes found inside.

Upvotes: 1

Related Questions