Reputation: 81
When I want to parsing XML document in Python using BeautifulSoup library, I faced some problems. The XML document that I want to parse:
<item>
<title><![CDATA[Title Sample]]></title>
<link /><![CDATA[http://banhada.kr/?cateCode=09&viewCode=S0941580]]>
<time_start>2011-10-10 09:00:00</time_start>
<time_end>2011-10-17 09:00:00</time_end>
<price_original>35000</price_original>
<price_now>20000</price_now>
</item>
As you can see above, tag is a little strange. In my opinion, that( tag) is not a stand XML form, right? How can I parse this terrible form?
Upvotes: 5
Views: 16406
Reputation: 879073
You could use BeautifulSoup to parse XML:
import bs4 as bs
content='''\
<item>
<title><![CDATA[Title Sample]]></title>
<link /><![CDATA[http://banhada.kr/?cateCode=09&viewCode=S0941580]]>
<time_start>2011-10-10 09:00:00</time_start>
<time_end>2011-10-17 09:00:00</time_end>
<price_original>35000</price_original>
<price_now>20000</price_now>
</item>'''
soup = bs.BeautifulSoup(content, 'xml')
title = soup.title
print(title.string)
# Title Sample
link = soup.link.nextSibling
print(link)
# http://banhada.kr/?cateCode=09&viewCode=S0941580
Under the hood, BeautifulSoup uses lxml for parsing XML. Although it's not needed here, you might want to use lxml directly, since it gives you more succinct ways to navigate through XML using XPath:
import lxml.etree as ET
content='''\
<item>
<title><![CDATA[Title Sample]]></title>
<link /><![CDATA[http://banhada.kr/?cateCode=09&viewCode=S0941580]]>
<time_start>2011-10-10 09:00:00</time_start>
<time_end>2011-10-17 09:00:00</time_end>
<price_original>35000</price_original>
<price_now>20000</price_now>
</item>'''
doc = ET.fromstring(content)
title = doc.find('title')
print(title.text)
# Title Sample
link = doc.find('link')
print(link.tail)
# http://banhada.kr/?cateCode=09&viewCode=S0941580
Upvotes: 7
Reputation: 82924
You don't need BeautifulStoneSoup or lxml. Python's included batteries do the job just fine, and there doesn't seem to be anything non-compliant about your XML.
>>> content='''\
... <item>
... <title><![CDATA[Title Sample]]></title>
... <link /><![CDATA[http://banhada.kr/?cateCode=09&viewCode=S0941580]]>
... <time_start>2011-10-10 09:00:00</time_start>
... <time_end>2011-10-17 09:00:00</time_end>
... <price_original>35000</price_original>
... <price_now>20000</price_now>
... </item>'''
>>> import xml.etree.cElementTree as et
>>> foo = et.XML(content)
>>> for e in foo:
... print e.tag, e.text, repr(e.tail)
...
title Title Sample '\n'
link None 'http://banhada.kr/?cateCode=09&viewCode=S0941580\n'
time_start 2011-10-10 09:00:00 '\n'
time_end 2011-10-17 09:00:00 '\n'
price_original 35000 '\n'
price_now 20000 '\n'
>>>
Upvotes: 8