Anil
Anil

Reputation: 1772

Getting javascript variable value while scraping with python

I know this is asked before also, but I am a newbie in scraping and python. Please help me and it would be very much helpful in my learning path.

I am scraping a news site using python with packages such as Beautiful Soup and etc.

I am facing difficulty while getting the value of java script variable which is declared in script tag and also it is getting updated there.

Here is the part of HTML page which I am scraping:(containing only script part)

<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

  <script type="text/javascript" src="/dist/scripts/index.js"></script>
  <script type="text/javascript" src="/dist/scripts/read.js"></script>
  <script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
  <script type="text/javascript">

    var min_news_id = "d7zlgjdu-1"; // line 1
    function loadMoreNews(){
      $("#load-more-btn").hide();
      $("#load-more-gif").show();
      $.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
          data = JSON.parse(data);
          min_news_id = data.min_news_id||min_news_id; // line 2
          $(".card-stack").append(data.html);
      })
      .fail(function(){alert("Error : unable to load more news");})
      .always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
    }
    jQuery.scrollDepth();
  </script>

From the above part, I want to get the value of min_news_id in python. I should also get the value of same variable if updated from line 2.

Here is how I am doing it:

    self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
    page = bs(htmlPage, "html.parser")
    //find all the scripts tag
    scripts = page.find_all("script")
    for script in scripts:
        for line in script:
            scriptString = str(line)
            if "min_news_id" in scriptString:
                scriptString.replace('"', '\\"')
                print(scriptString)
                if(self.pattern.match(str(scriptString))):
                    print("matched")
                    data = self.pattern.match(scriptString)
                    jsVariable = json.loads(data.groups()[0])
                    InShortsScraper.newsOffset = jsVariable
                    print(InShortsScraper.newsOffset)

But I am never getting the value of the variable. Is it problem with my regular expression or any other? Please help me. Thank You in advance.

Upvotes: 3

Views: 13269

Answers (3)

Anil
Anil

Reputation: 1772

thank you for the response, Finally I solved using requests package after reading its documentation,

here is my code :

if InShortsScraper.firstLoad == True:
            self.pattern = re.compile('var min_news_id = (.+?);')
        else:
            self.pattern = re.compile('min_news_id = (.+?);')
        page = None
        # print("Pattern: " + str(self.pattern))
        if news_offset == None:
            htmlPage = urlopen(url)
            page = bs(htmlPage, "html.parser")
        else:
            self.loadMore['news_offset'] = InShortsScraper.newsOffset
            # print("payload : " + str(self.loadMore))
            try:
                r = myRequest.post(
                    url = url,
                    data = self.loadMore
                )
            except TypeError:
                print("Error in loading")

            InShortsScraper.newsOffset = r.json()["min_news_id"]
            page = bs(r.json()["html"], "html.parser")
        #print(page)
        if InShortsScraper.newsOffset == None:
            scripts = page.find_all("script")
            for script in scripts:
                for line in script:
                    scriptString = str(line)
                    if "min_news_id" in scriptString:
                        finder = re.findall(self.pattern, scriptString)
                        InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()

Upvotes: 0

ewwink
ewwink

Reputation: 19184

you can't monitor javascript variable change using BeautifulSoup, here how to get next page news using while loop, re and json

from bs4 import BeautifulSoup
import requests, re

page_url = 'https://inshorts.com/en/read/politics'
ajax_url = 'https://inshorts.com/en/ajax/more_news'

htmlPage = requests.get(page_url).text
# BeautifulSoup extract article summary
# page = BeautifulSoup(htmlPage, "html.parser")
# ...

# get current min_news_id
min_news_id = re.search('min_news_id\s+=\s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1

customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}

while min_news_id:
    # change "politics" if in different category
    reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
    # get Ajax next page
    ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
    # again, do extract article summary
    page = BeautifulSoup(ajax_response["html"], "html.parser")
    # ....
    # ....

    # new min_news_id
    min_news_id = ajax_response["min_news_id"]

    # remove this to loop all page (thousand?)
    break

Upvotes: 1

Kamikaze_goldfish
Kamikaze_goldfish

Reputation: 861

html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

  <script type="text/javascript" src="/dist/scripts/index.js"></script>
  <script type="text/javascript" src="/dist/scripts/read.js"></script>
  <script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
  <script type="text/javascript">

    var min_news_id = "d7zlgjdu-1"; // line 1
    function loadMoreNews(){
      $("#load-more-btn").hide();
      $("#load-more-gif").show();
      $.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
          data = JSON.parse(data);
          min_news_id = data.min_news_id||min_news_id; // line 2
          $(".card-stack").append(data.html);
      })
      .fail(function(){alert("Error : unable to load more news");})
      .always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
    }
    jQuery.scrollDepth();
  </script>'''

finder = re.findall(r'min_news_id = .*;', html)
print(finder)

Output:
['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']

#2 OR YOU CAN USE

print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())

Output:
d7zlgjdu-1

#3 OR YOU CAN USE

finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
print(finder)   

Output:
['d7zlgjdu-1'] 

Upvotes: 2

Related Questions