Reputation: 5184
I am trying to get the historical economic calendar data from this website - https://www.investing.com/economic-calendar/ from the following dates (1 Feb 2020 to 5 Feb 2020).
Today is 4 Feb 2020.
If I use the https://www.investing.com/economic-calendar/ url below, I am able to extract the table using beautifulsoup but I am unable to select any day except the current day. I get a table saved in my python script for (4 Feb 2020) which is today.
import requests
import pandas as pd
from bs4 import BeautifulSoup
payload = {"country[]":["25","32","6","37","72","22","17","39","14","10","35","43","56","36","110","11","26","12","4","5"],
"dateFrom":"2020-02-01",
"dateTo":"2020-02-05",
"timeZone":"8",
"timeFilter":"timeRemain",
"currentTab":"custom",
"limit_from":"0"}
urlheader = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
url = "https://www.investing.com/economic-calendar/"
req = requests.post(url, data=payload, headers=urlheader)
print(req)
soup = BeautifulSoup(req.content, "lxml")
table = soup.find('table', id="economicCalendarData")
The table variable looks like this
I can see that it sends a post request to "https://www.investing.com/economic-calendar/Service/getCalendarFilteredData" whenever I change the date range or filter settings.
Here is the request data I found.
Here is the POST link
So I use the following code instead, as I want to select the dates.
import requests
import pandas as pd
from bs4 import BeautifulSoup
payload = {"country[]":["25","32","6","37","72","22","17","39","14","10","35","43","56","36","110","11","26","12","4","5"],
"dateFrom":"2020-02-01",
"dateTo":"2020-02-05",
"timeZone":"8",
"timeFilter":"timeRemain",
"currentTab":"custom",
"limit_from":"0"}
urlheader = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
url = "https://www.investing.com/economic-calendar/Service/getCalendarFilteredData"
req = requests.post(url, data=payload, headers=urlheader)
print(req)
soup = BeautifulSoup(req.content, "lxml")
table = soup.find('table', id="economicCalendarData")
But this time, there is no economicCalendarData, so the table variable comes out empty. The soup variable has data in it but there's no table data in it.
This is the table I'm trying to save.
Like I said earlier, if I use the url as https://www.investing.com/economic-calendar/, I can get the table data for the current day only (4 Feb 2020); no matter what dates I enter into the payload (dateFrom, dateTo).
For some reason, the table comes up empty when I try to post to https://www.investing.com/economic-calendar/Service/getCalendarFilteredData instead, even though the soup variable contains data, it's not the data I request. What am I doing wrong? How do I save the tables on the dates I select?
Upvotes: 0
Views: 1749
Reputation: 22440
You were real close. If I understood your requirements, the following should get you there:
import requests
from bs4 import BeautifulSoup
url = "https://www.investing.com/economic-calendar/Service/getCalendarFilteredData"
payload = {"country[]":["25","32","6","37","72","22","17","39","14","10","35","43","56","36","110","11","26","12","4","5"],
"dateFrom":"2020-02-01",
"dateTo":"2020-02-05",
"timeZone":"8",
"timeFilter":"timeRemain",
"currentTab":"custom",
"limit_from":"0"}
req = requests.post(url, data=payload, headers={
"User-Agent":"Mozilla/5.0",
"X-Requested-With": "XMLHttpRequest"
})
soup = BeautifulSoup(req.json()['data'],"lxml")
for items in soup.select("tr"):
data = [item.get_text(strip=True) for item in items.select("th,td")]
print(data)
Upvotes: 2