Reputation: 1961
I have data in XML format. Example is shown as follow. I want to extract data from <text> tag
.
Here is my XML data.
<text>
The 40-Year-Old Virgin is a 2005 American buddy comedy
film about a middle-aged man's journey to finally have sex.
<h1>The plot</h1>
Andy Stitzer (Steve Carell) is the eponymous 40-year-old virgin.
<h1>Cast</h1>
<h1>Soundtrack</h1>
<h1>External Links</h1>
</text>
I need only The 40-Year-Old Virgin is a 2005 American buddy comedy film about a middle-aged man's journey to finally have sex.
Is it possible? thanks
Upvotes: 2
Views: 198
Reputation: 7336
Whenever you find yourself looking at XML data and thinking about regular expressions, you should stop and ask yourself why you aren't considering a real XML parser. The structure of XML makes it perfectly suited for a proper parser, and maddeningly frustrating for regular expressions.
If you must use regular expressions, the following should do. Until your document changes!
import re
p = re.compile("<text>(.*)<h1>")
p.search(xml_text).group(1)
Spoiler: regular expressions might be appropriate if this is just a one-off problem that needs a quick and dirty solution. Or they might be appropriate if you know the input data will be fairly static and can't tolerate the overhead of a parser.
Upvotes: 2
Reputation: 33716
Here is an example using xml.etree.ElementTree:
>>> import xml.etree.ElementTree as ET
>>> data = """<text>
... The 40-Year-Old Virgin is a 2005 American buddy comedy
... film about a middle-aged man's journey to finally have sex.
...
... <h1>The plot</h1>
... Andy Stitzer (Steve Carell) is the eponymous 40-year-old virgin.
... <h1>Cast</h1>
...
... <h1>Soundtrack</h1>
...
... <h1>External Links</h1>
... </text>"""
>>> xml = ET.XML(data)
>>> xml.text
"\n The 40-Year-Old Virgin is a 2005 American buddy comedy\n film about a middle-aged man's journey to finally have sex.\n\n "
>>> xml.text.strip().replace('\n ', '')
"The 40-Year-Old Virgin is a 2005 American buddy comedy film about a middle-aged man's journey to finally have sex."
And there you go!
Upvotes: 1
Reputation: 500327
Here is how you could do this using ElementTree
:
In [18]: import xml.etree.ElementTree as et
In [19]: t = et.parse('f.xml')
In [20]: print t.getroot().text.strip()
The 40-Year-Old Virgin is a 2005 American buddy comedy
film about a middle-aged man's journey to finally have sex.
Upvotes: 2
Reputation: 879451
Use an XML parser to parse XML. Using lxml:
import lxml.etree as ET
content='''\
<text>
The 40-Year-Old Virgin is a 2005 American buddy comedy
film about a middle-aged man's journey to finally have sex.
<h1>The plot</h1>
Andy Stitzer (Steve Carell) is the eponymous 40-year-old virgin.
<h1>Cast</h1>
<h1>Soundtrack</h1>
<h1>External Links</h1>
</text>
'''
text=ET.fromstring(content)
print(text.text)
yields
The 40-Year-Old Virgin is a 2005 American buddy comedy
film about a middle-aged man's journey to finally have sex.
Upvotes: 4
Reputation: 5198
Don't use regular expression to parse XML/HTML. Use a proper parser like BeautifulSoup or lxml in python.
Upvotes: 2