A. Sam
A. Sam

Reputation: 893

Scraping AJAX loaded content with python?

So i have function that is called when i click a button , it goes as below

var min_news_id = "68feb985-1d08-4f5d-8855-cb35ae6c3e93-1";
function loadMoreNews(){
  $("#load-more-btn").hide();
  $("#load-more-gif").show();
  $.post("/en/ajax/more_news",{'category':'','news_offset':min_news_id},function(data){
      data = JSON.parse(data);
      min_news_id = data.min_news_id||min_news_id;
      $(".card-stack").append(data.html);
  })
  .fail(function(){alert("Error : unable to load more news");})
  .always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();

Now i don't have much experience with javascript , but i assume its returning some json data from some sort of api at "en/ajax/more_news" .

Is there i way could directly call this api and get the json data from my python script. If Yes,how?

If not how do i scrape the content that is being generated?

Upvotes: 2

Views: 4926

Answers (2)

Salman Mohammad
Salman Mohammad

Reputation: 182

Here is the script that will automatically loop through all the pages in inshort.com

from bs4 import BeautifulSoup
from newspaper import Article
import requests
import sys
import re
import json

patt = re.compile('var min_news_id\s+=\s+"(.*?)"')
i = 0
while(1):
    with requests.Session() as s:
        if(i==0):soup = BeautifulSoup(s.get("https://www.inshorts.com/en/read").content,"lxml")
  new_id_scr = soup.find("script", text=re.compile("var\s+min_news_id"))
   news_id = patt.search(new_id_scr.text).group(1)

    js = s.post("https://www.inshorts.com/en/ajax/more_news", data={"news_offset":news_id})
    jsn = json.dumps(js.json())
    jsonToPython = json.loads(jsn)
    news_id = jsonToPython["min_news_id"]
    data = jsonToPython["html"]
    i += 1
    soup = BeautifulSoup(data, "lxml")
    for tag in soup.find_all("div", {"class":"news-card"}):
        main_text = tag.find("div", {"itemprop":"articleBody"})
        summ_text = main_text.text
        summ_text = summ_text.replace("\n", " ")
        result = tag.find("a", {"class":"source"})
        art_url = result.get('href') 
        if 'www.youtube.com' in art_url:
            print("Nothing")
        else:
            art_url = art_url[:-1]
            #print("Hello", art_url)
            article = Article(art_url)
            article.download()
            if article.is_downloaded:
                article.parse()
                article_text = article.text
                article_text = article_text.replace("\n", " ")

                print(article_text+"\n")
                print(summ_text+"\n")        

It gives both the summary from inshort.com and complete news from respective news channel.

Upvotes: 0

Padraic Cunningham
Padraic Cunningham

Reputation: 180391

You need to post the news id that you see inside the script to https://www.inshorts.com/en/ajax/more_news, this is an example using requests:

from bs4 import BeautifulSoup
import requests
import re

# pattern to extract min_news_id
patt = re.compile('var min_news_id\s+=\s+"(.*?)"')

with requests.Session() as s:
    soup = BeautifulSoup(s.get("https://www.inshorts.com/en/read").content)
    new_id_scr = soup.find("script", text=re.compile("var\s+min_news_id"))
    print(new_id_scr.text)
    news_id = patt.search(new_id_scr.text).group()
    js = s.post("https://www.inshorts.com/en/ajax/more_news", data={"news_offset":news_id})
    print(js.json())

js gives you all the html, you just have to access the js["html"].

Upvotes: 1

Related Questions