Deina Underhill
Deina Underhill

Reputation: 567

Extracting HTML data fields with Python

Please forgive me for my lack of knowledge, but given HTML in the following format, what is the best way to extract the individual data fields? Please keep in mind that more often than not some, or all, of them will be NULL in which case we'll keep them at NULL.

<div class="profile-section" id="a-bit-more-about">
                            <dl>
            <dt>Name:</dt>
            <dd><span class="given-name">Clem</span> <span class="family-name">Kadiddlehopper</span></dd>
        </dl>
        <!-- <span class="RealName">/ <span class="fn n"><span class="given-name">Clem</span> <span class="family-name">Kadiddlehopper</span></span></span> -->
                        <dl>
        <dt>Joined:</dt>
        <dd>September 1910</dd>
    </dl>
    <div class="sep"></div>
    <dl>
        <dt>Hometown:</dt>
        <dd>Quiet Rest Maximum Security Twilight Home</dd>
    </dl>
    <dl>
        <dt>Currently:</dt>
        <dd><span class="adr"><span class="locality">They won't tell me</span>, <span class="country-name">Zimbobwe</span></span></dd>
    </dl>
    <div class="sep"></div>

Upvotes: 0

Views: 865

Answers (2)

zhangyangyu
zhangyangyu

Reputation: 8610

Use third-party modules beautiful soup, lxml or built-in module html.parser. For example:

from bs4 import BeautifulSoup
soup = BeautifulSoup('<html><body><a>bbb</a></body></html')
soup.find('a')

Or if like, you can use regex for small target.

Upvotes: 2

WeaselFox
WeaselFox

Reputation: 7380

You want an HTML parser. I recommend beautiful soup or lxml.

Upvotes: 3

Related Questions