Reputation: 4561
considering this:
input = """Yesterday<person>Peter</person>drove to<location>New York</location>"""
how can one use regex patterns to extract:
person: Peter
location: New York
This works well, but I dont want to hard code the tags, they can change:
print re.findall("<person>(.*?)</person>", input)
print re.findall("<location>(.*?)</location>", input)
Upvotes: 2
Views: 114
Reputation: 4940
Use a tool designed for the work. I happen to like lxml but their are other
>>> minput = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""
>>> from lxml import html
>>> tree = html.fromstring(minput)
>>> for e in tree.iter():
print e, e.tag, e.text_content()
if e.tag() == 'person': # getting the last name per comment
last = e.text_content().split()[-1]
print last
<Element p at 0x3118ca8> p YesterdayPeter Smithdrove toNew York
<Element person at 0x3118b48> person Peter Smith
Smith # here is the last name
<Element location at 0x3118ba0> location New York
If you are new to Python then you might want to visit this site to get an installer for a number of packages including LXML.
Upvotes: 6
Reputation: 473853
Avoid parsing HTML with regex, use an HTML parser instead.
Here's an example using BeautifulSoup:
from bs4 import BeautifulSoup
data = "Yesterday<person>Peter</person>drove to<location>New York</location>"
soup = BeautifulSoup(data)
print 'person: %s' % soup.person.text
print 'location: %s' % soup.location.text
prints:
person: Peter
location: New York
Note the simplicity of the code.
Hope that helps.
Upvotes: 3