Reputation: 396

How to parse html inside CDATA using Python?

I'm getting a XML object from a website that looks like this:

<?xml version=\'1.0\' encoding=\'UTF-8\'?>
<partial-response id="j_id1">
    <changes>
        <update id="loginForm:tabelaProcessos">
            <![CDATA[<tr data-ri="5" class="ui-widget-content ui-datatable-odd" role="row"><td role="gridcell" style="word-break:break-all;"><span style="font-size:7pt;text-align: center;" title="XPT">08454.8100</span></td><td role="gridcell"><span style="font-size:7pt;" title="tDFvo">ARÁ</span></td><td role="gridcell"><span style="font-size:7pt;" title="PDSDo">TA15A</span></td><td role="gridcell"><span style="font-size:7pt;" title="P125ão">MINIRAL</span></td><td role="gridcell"><span style="font-size:7pt;" title="A12o">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="O4545ão">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="A45So">- </span></td><td role="gridcell"><span style="font-size:7pt;" title="ASD1vo">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="D45el">18/02/2021 04:35:30</span></td><td role="gridcell"><span style="font-size:7pt;" title="Idto">405833357</span></td></tr>]]>
        </update>
        <update id="j_id1:javax.faces.ViewState:0">
            <![CDATA[-8530455S7417:3382887371AS10732]]>
        </update>
        <extension ln="primefaces" type="args">{"totalRecords":1}</extension>
    </changes>
</partial-response>

I need to parse the table rows inside CDATA. I tried to use it as input to lxml.html.fromstring() but the output provided ignores CDATA content. Any way to get everything inside the CDATA using lxml or other Python lib?

Upvotes: 0

Answers (1)

mr_mooo_cow

Reputation: 1128

Use BeautifulSoup. CData is a subclass of a NavigableString.

import bs4

data = """<?xml version=\'1.0\' encoding=\'UTF-8\'?>
<partial-response id="j_id1">
    <changes>
        <update id="loginForm:tabelaProcessos">
            <![CDATA[<tr data-ri="5" class="ui-widget-content ui-datatable-odd" role="row"><td role="gridcell" style="word-break:break-all;"><span style="font-size:7pt;text-align: center;" title="XPT">08454.8100</span></td><td role="gridcell"><span style="font-size:7pt;" title="tDFvo">ARÁ</span></td><td role="gridcell"><span style="font-size:7pt;" title="PDSDo">TA15A</span></td><td role="gridcell"><span style="font-size:7pt;" title="P125ão">MINIRAL</span></td><td role="gridcell"><span style="font-size:7pt;" title="A12o">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="O4545ão">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="A45So">- </span></td><td role="gridcell"><span style="font-size:7pt;" title="ASD1vo">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="D45el">18/02/2021 04:35:30</span></td><td role="gridcell"><span style="font-size:7pt;" title="Idto">405833357</span></td></tr>]]>
        </update>
        <update id="j_id1:javax.faces.ViewState:0">
            <![CDATA[-8530455S7417:3382887371AS10732]]>
        </update>
        <extension ln="primefaces" type="args">{"totalRecords":1}</extension>
    </changes>
</partial-response>"""

soup = bs4.BeautifulSoup(data, 'html.parser')

for cd in soup.findAll(text=True):
    if isinstance(cd, bs4.CData):
        print('CData contents: %r' % cd)

reference: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#comments-and-other-special-strings

Upvotes: 1

How to parse html inside CDATA using Python?

Answers (1)

Related Questions