Reputation: 191
hi all I am new to python. please help me with this requirement.
http://www.example.com/ratings/ratings-rationales.jsp?date=true&result=true
In this link, I have to choose date first, then the rating company will list its publications as links. Now i wanted to search a link that contains a word of my interest say "stable". I have tried the following using python 3.4.2
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
url = "http://www.example.com/ratings/ratings-rationales.jsp?date=true&result=true"
r = requests.get(url)
soup = BeautifulSoup(r.content)
example_links = lambda tag: getattr(tag, 'name', None) == 'a' and 'stable' in tag.get_text().lower() and 'href' in tag.attrs
results = soup.find_all(example_links)
result_links = [urljoin(url, tag['href']) for tag in results]
print (result_links)
This is not printing anything. Iam seeing below as result
>>>
[]
Obviously Iam not giving date as input.
1. How to input from and to dates as today's date ? (Obviously to check periodically for updates of the links containing a word of interest, which will be question for later time)
For example after giving from date: 31-12-2014 to date: 31-12-2014 as inputs
is the output I need as hyperlink.
Any suggestion will be much useful. Thanks in advance
Here is the updated code still Iam not able to get the result. >>> []
is the output
from datetime import datetime
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
#Getting the current date
today = datetime.today()
#For the sake of brevity some parameters are missing on the payload
payload = {
'selArchive': 1,
'selDay': 31,
'selMonth': 12,
'selYear': 2014,
'selDay1': 31,
'selMonth1': 12,
'selYear1': 2014,
'selSector': '',
'selIndustry': '',
'selCompany': ''
}
example_url = "http://www.example.com/
r = requests.post(example_url, data=payload)
rg = requests.get(example_url)
soup = BeautifulSoup(rg.content)
crisil_links = lambda tag: getattr(tag, 'name', None) == 'a' and 'stable' in tag.get_text().lower() and 'href' in tag.attrs
results = soup.find_all(example_links)
result_links = [urljoin(url, tag['href']) for tag in results]
print (result_links)
Upvotes: 2
Views: 11858
Reputation: 3043
You should be doing a POST instead of a GET for this particular site (this link on how to form a post request with parameters).
Check this example:
from datetime import datetime
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import requests
#Getting the current date
today = datetime.today()
#Here I'm only passing from and to dates (current date) and the industry parameter
payload = {
'selDay': 31,
'selMonth': 12,
'selYear': 2014,
'selDay1': 31,
'selMonth1': 12,
'selYear1': 2014,
'selIndustry': '',
'txtPhrase': '',
'txtInclude': '',
'txtExclude': '',
'selSubServices': 'ALL',
'selServices': 'all',
'maxresults': 10,
'pageno': 1,
'srchInSrchCol': '01',
'sortOptions': 'date',
'isSrchInSrch': '01',
'txtShowQuery': '01',
'tSearch': 'Find a Rating',
'txtSearch': '',
'selArchive': 1,
'selSector': 148,
'selCompany': '',
'x': 40,
'y': 11,
}
crisil_url = "http://www.crisil.com/ratings/ratings-rationales.jsp?result=true&Sector=true"
r = requests.post(crisil_url, data=payload)
soup = BeautifulSoup(r.content)
crisil_links = lambda tag: getattr(tag, 'name', None) == 'a' and 'stable' in tag.get_text().lower() and 'href' in tag.attrs
results = soup.find_all(crisil_links)
result_links = [urljoin(crisil_url, tag['href']) for tag in results]
print (result_links)
You will need to check the ids of the industries you are filtering, so be sure to check them via Inspect Element, selecting a the select box of industries on the browser.
After that, you will get the response and do the parsing via BeautifulSoup, as you are doing now.
Checking periodically: To check this periodically you should consider crontab if using Linux/Unix or a Scheduled task if using Windows.
Upvotes: 2