Reputation: 63
I am trying to scrape a website. I have tried using two methods but both do not provide me with the full website source code that I am looking for. I am trying to scrape the news titles from the website URL provided below.
URL: "https://www.todayonline.com/"
These are the two methods I have tried but failed.
tdy_url = "https://www.todayonline.com/"
page = requests.get(tdy_url).text
soup = BeautifulSoup(page)
soup # Returns me a HTML with javascript text
soup.find_all('h3')
### Returns me empty list []
tdy_url = "https://www.todayonline.com/"
options = Options()
options.headless = True
driver = webdriver.Chrome("chromedriver",options=options)
driver.get(tdy_url)
time.sleep(10)
html = driver.page_source
soup = BeautifulSoup(html)
soup.find_all('h3')
### Returns me only less than 1/4 of the 'h3' tags found in the original page source
Please help. I have tried scraping other news websites and it is so much easier. Thank you.
Upvotes: 6
Views: 18100
Reputation: 1858
The news data on the website you are trying to scrape is fetched from the server using JavaScript (this is called XHR -- XMLHttpRequest). It is happening dynamically, while the page is loading or being scrolled. so this data is not returned inside the page returned by the server.
In the first example, you are getting only the page returned by the server -- without the news, but with JS that is supposed to get them. Neither requests nor BeautifulSoup can execute JS.
However, you can try to reproduce requests that are getting news titles from the server with Python requests. Do the following steps:
Copy the request link (right-click -> Copy -> Copy link), and pass it to requests.get(...)
.
Get .json()
of the request. It will return a dict that is easy to work with. To better understand the structure of the dict, I would recommend to use pprint
instead of simple print. Note you have to do from pprint import pprint
before using it.
Here is an example of the code that gets the titles from the main news on the page:
import requests
nodes = requests.get("https://www.todayonline.com/api/v3/news_feed/7")\
.json()["nodes"]
for node in nodes:
print(node["node"]["title"])
If you want to scrape a group of news under caption, you need to change the number after news_feed/
in the request URL (to get it, you just need to filter the requests by "news_feed" in the DevTools and scroll the news page down).
Sometimes web sites have protection against bots (although the website you are trying to scrape doesn't). In such cases, you might need to do these steps as well.
Upvotes: 5
Reputation: 95
I will suggest you the fairly simple approach,
import requests
from bs4 import BeautifulSoup as bs
page = requests.get('https://www.todayonline.com/googlenews.xml').content
soup = bs(page)
news = [i.text for i in soup.find_all('news:title')]
print(news)
output
['DBS named world’s best bank by New York-based financial publication',
'Russia has very serious questions to answer on Navalny - UK',
"Exclusive: 90% of China's Sinovac employees, families took coronavirus vaccine - CEO",
'Three militants killed after fatal attack on policeman in Tunisia',
.....]
Also, you can check the XML page for more information if required.
P.S. Always check for the compliance before scraping any website :)
Upvotes: 2
Reputation:
There are different ways of gathering the content of a webpage that contains Javascript.
selenium
with Firefox web driverphantomJS
requests
libraryYou have to do your research first
Upvotes: 0
Reputation: 3910
You can access data via API (check out the Network tab):
For example,
import requests
url = "https://www.todayonline.com/api/v3/news_feed/7"
data = requests.get(url).json()
Upvotes: 3