Reputation: 1824
I have tweets saved in an XML file as:
<tweet>
<tweetid>142389495503925248</tweetid>
<user>ccifuentes</user>
<content><![CDATA[Salgo de #VeoTV , que día más largoooooo...]]></content>
<date>2011-12-02T00:47:55</date>
<lang>es</lang>
<sentiments>
<polarity><value>NONE</value><type>AGREEMENT</type></polarity>
</sentiments>
<topics>
<topic>otros</topic>
</topics>
</tweet>
To parse these, I created a BeautifulSoup instance via
soup = BeautifulSoup(xml, "lxml")
where xml is the raw XML file. To access a single tweet I did this:
tweets = soup.find_all('tweet')
for tw in tweets:
print(tw)
break
This results in
<tweet>
<tweetid>142389495503925248</tweetid>
<user>ccifuentes</user>
<content></content>
<date>2011-12-02T00:47:55</date>
<lang>es</lang>
<sentiments>
<polarity><value>NONE</value><type>AGREEMENT</type></polarity>
</sentiments>
<topics>
<topic>otros</topic>
</topics>
</tweet>
Note that the CDATA part was omitted when I printed the first tweet. It is important for me to get it, how can I do this?
Upvotes: 3
Views: 1586
Reputation: 12158
soup = bs4.BeautifulSoup(xml, 'xml')
change the parser to xml
out:
<content>Salgo de #VeoTV , que día más largoooooo...</content>
OR html.parser
:
soup = bs4.BeautifulSoup(xml, 'html.parser')
out:
<content><![CDATA[Salgo de #VeoTV , que día más largoooooo...]]></content>
Upvotes: 5