CookieData
CookieData

Reputation: 63

How to scrape a javascript website in Python?

I am trying to scrape a website. I have tried using two methods but both do not provide me with the full website source code that I am looking for. I am trying to scrape the news titles from the website URL provided below.

URL: "https://www.todayonline.com/"

These are the two methods I have tried but failed.

Method 1: Beautiful Soup

tdy_url = "https://www.todayonline.com/"
page = requests.get(tdy_url).text
soup = BeautifulSoup(page)
soup  # Returns me a HTML with javascript text
soup.find_all('h3')

### Returns me empty list []

Method 2: Selenium + BeautifulSoup

tdy_url = "https://www.todayonline.com/"

options = Options()
options.headless = True

driver = webdriver.Chrome("chromedriver",options=options)

driver.get(tdy_url)
time.sleep(10)
html = driver.page_source

soup = BeautifulSoup(html)
soup.find_all('h3')

### Returns me only less than 1/4 of the 'h3' tags found in the original page source 

Please help. I have tried scraping other news websites and it is so much easier. Thank you.

Upvotes: 6

Views: 18100

Answers (4)

demian-wolf
demian-wolf

Reputation: 1858

The news data on the website you are trying to scrape is fetched from the server using JavaScript (this is called XHR -- XMLHttpRequest). It is happening dynamically, while the page is loading or being scrolled. so this data is not returned inside the page returned by the server.

In the first example, you are getting only the page returned by the server -- without the news, but with JS that is supposed to get them. Neither requests nor BeautifulSoup can execute JS.

However, you can try to reproduce requests that are getting news titles from the server with Python requests. Do the following steps:

  1. Open DevTools of your browser (usually you have to press F12 or the combination of Ctrl+Shift+I for that), and take a look at requests that are getting news titles from the server. Sometimes, it is even easier than web scraping with BeautifulSoup. Here is a screenshot (Firefox): Screenshot (Firefox)
  1. Copy the request link (right-click -> Copy -> Copy link), and pass it to requests.get(...).

  2. Get .json() of the request. It will return a dict that is easy to work with. To better understand the structure of the dict, I would recommend to use pprint instead of simple print. Note you have to do from pprint import pprint before using it.

Here is an example of the code that gets the titles from the main news on the page:

import requests


nodes = requests.get("https://www.todayonline.com/api/v3/news_feed/7")\
        .json()["nodes"]
for node in nodes:
    print(node["node"]["title"])

If you want to scrape a group of news under caption, you need to change the number after news_feed/ in the request URL (to get it, you just need to filter the requests by "news_feed" in the DevTools and scroll the news page down).

Sometimes web sites have protection against bots (although the website you are trying to scrape doesn't). In such cases, you might need to do these steps as well.

Upvotes: 5

Manan Gajjar
Manan Gajjar

Reputation: 95

I will suggest you the fairly simple approach,

import requests
from bs4 import BeautifulSoup as bs

page = requests.get('https://www.todayonline.com/googlenews.xml').content
soup = bs(page)
news = [i.text for i in soup.find_all('news:title')]

print(news)

output

['DBS named world’s best bank by New York-based financial publication',
 'Russia has very serious questions to answer on Navalny - UK',
 "Exclusive: 90% of China's Sinovac employees, families took coronavirus vaccine - CEO",
 'Three militants killed after fatal attack on policeman in Tunisia',
.....]

Also, you can check the XML page for more information if required.

P.S. Always check for the compliance before scraping any website :)

Upvotes: 2

user13959692
user13959692

Reputation:

There are different ways of gathering the content of a webpage that contains Javascript.

  1. Using selenium with Firefox web driver
  2. Using a headless browser with phantomJS
  3. Making an API call using a REST client or python requests library

You have to do your research first

Upvotes: 0

Hryhorii Pavlenko
Hryhorii Pavlenko

Reputation: 3910

You can access data via API (check out the Network tab): enter image description here


For example,

import requests
url = "https://www.todayonline.com/api/v3/news_feed/7"
data = requests.get(url).json()

Upvotes: 3

Related Questions