DevEx
DevEx

Reputation: 4561

Regex pattern to extract tag and its contents

considering this:

input = """Yesterday<person>Peter</person>drove to<location>New York</location>"""

how can one use regex patterns to extract:

person: Peter
location: New York

This works well, but I dont want to hard code the tags, they can change:

print re.findall("<person>(.*?)</person>", input)
print re.findall("<location>(.*?)</location>", input)

Upvotes: 2

Views: 114

Answers (2)

PyNEwbie
PyNEwbie

Reputation: 4940

Use a tool designed for the work. I happen to like lxml but their are other

>>> minput = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""
>>> from lxml import html
>>> tree = html.fromstring(minput)
>>> for e in tree.iter():
        print e, e.tag, e.text_content()
        if e.tag() == 'person':          # getting the last name per comment
           last = e.text_content().split()[-1]
           print last


<Element p at 0x3118ca8> p YesterdayPeter Smithdrove toNew York
<Element person at 0x3118b48> person Peter Smith
Smith                                            # here is the last name
<Element location at 0x3118ba0> location New York

If you are new to Python then you might want to visit this site to get an installer for a number of packages including LXML.

Upvotes: 6

alecxe
alecxe

Reputation: 473853

Avoid parsing HTML with regex, use an HTML parser instead.

Here's an example using BeautifulSoup:

from bs4 import BeautifulSoup    

data = "Yesterday<person>Peter</person>drove to<location>New York</location>"
soup = BeautifulSoup(data)

print 'person: %s' % soup.person.text
print 'location: %s' % soup.location.text

prints:

person: Peter
location: New York

Note the simplicity of the code.

Hope that helps.

Upvotes: 3

Related Questions