Reputation: 13
im try to scrape news data from forex calendar, but i have small problem the xml file have
def get_news_calendar():
r = requests.get('http://www.forexfactory.com/ffcal_week_this.xml')
soup = BeautifulSoup(r.text , 'lxml')
events = soup.find_all('event')
for event in events:
print event.find('title').text, event.find('country').text, event.find('date'), event.find('time').text, event.find('impact').text, event.find('forecast').text, event.find('previous').text
output :
Current Account EUR <date></date>
Retail Sales m/m GBP <date></date>
MPC Member Saunders Speaks GBP <date></date>
Core CPI m/m CAD <date></date>
CPI m/m CAD <date></date>
Trimmed CPI y/y CAD <date></date>
Median CPI y/y CAD <date></date>
Common CPI y/y CAD <date></date>
FOMC Member Kashkari Speaks USD <date></date>
Flash Manufacturing PMI USD <date></date>
Flash Services PMI USD <date></date>
Existing Home Sales USD <date></date>
IMF Meetings ALL <date></date>
IMF Meetings ALL <date></date>
Treasury Sec Mnuchin Speaks USD <date></date>
French Presidential Election EUR <date></date>
example xml file :
<event>
<title>German Flash Manufacturing PMI</title>
<country>EUR</country>
<date><![CDATA[04-21-2017]]></date>
<time><![CDATA[7:30am]]></time>
<impact><![CDATA[Medium]]></impact>
<forecast><![CDATA[58.1]]></forecast>
<previous><![CDATA[58.3]]></previous>
</event>
how i can print the value of cdata ?
Upvotes: 1
Views: 2636
Reputation: 107567
Consider directly using lxml
and run xpath
on all <event>
nodes as .text()
can retrieve CData content.
import requests
import lxml.etree as et
def get_news_calendar():
r = requests.get('http://www.forexfactory.com/ffcal_week_this.xml')
data = et.fromstring(r.text.encode("utf-8"))
events = data.xpath('//event')
for event in events:
print(event.find('title').text, event.find('country').text,
event.find('date').text, event.find('time').text,
event.find('impact').text, event.find('forecast').text,
event.find('previous').text)
get_news_calendar()
# Bank Holiday NZD 04-16-2017 9:00pm Holiday None None
# Bank Holiday AUD 04-16-2017 10:00pm Holiday None None
# GDP q/y CNY 04-17-2017 2:00am High 6.8% 6.8%
# Industrial Production y/y CNY 04-17-2017 2:00am High 6.2% 6.3%
# Fixed Asset Investment ytd/y CNY 04-17-2017 2:00am Medium 8.8% 8.9%
# NBS Press Conference CNY 04-17-2017 2:00am Medium None None
# Retail Sales y/y CNY 04-17-2017 2:00am Low 9.7% 9.5%
# Bank Holiday CHF 04-17-2017 6:00am Holiday None None
# BOJ Gov Kuroda Speaks JPY 04-17-2017 6:15am High None None
# Bank Holiday GBP 04-17-2017 7:00am Holiday None None
# French Bank Holiday EUR 04-17-2017 7:00am Holiday None None
# ...
Upvotes: 0
Reputation: 64949
You appear to have got the name of the parser wrong. You are parsing an XML document, so you need to use lxml-xml
instead of lxml
.
Try replacing
soup = BeautifulSoup(r.text , 'lxml')
with
soup = BeautifulSoup(r.text , 'lxml-xml')
After making this change to your get_news_calendar
function I get the following output running it on your example XML file:
German Flash Manufacturing PMI EUR <date>04-21-2017</date> 7:30am Medium 58.1 58.3
Upvotes: 2