Asir Shahriar Roudra
Asir Shahriar Roudra

Reputation: 360

Why can't I scrape all data from ecommerce websites?

Actually I'm working on a project where I have to scrape data from e-commerce websites. But I can't access my desired data from these sites. For example, when I want to scrape all list from https://evaly.com.bd/search-results?query=remax%20610d site, I only get <li class="ais-InfiniteHits-sentinel"></li> as output. Besides, when I print HTML code of the site using print(soup.prettify()) The full code is not in the output. Here is my code for all list items :

from bs4 import BeautifulSoup
import requests
link = "https://evaly.com.bd/search-results?query=remax%20610"

source = requests.get(
       link).text

soup = BeautifulSoup(source, 'lxml')
#print(soup.prettify())
li = soup.find_all("li")
print(li)

And here is the output when I run print(soup.prettify()) :

<!DOCTYPE html>
<html>
 <head>
  <style data-styled="" data-styled-version="5.2.0">
   .lfkzsQ{background-color:white;-webkit-letter-spacing:0.025em;-moz-letter-spacing:0.025em;-ms-letter-spacing:0.025em;letter-spacing:0.025em;font-weight:500;font-size:15px;height:46px;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-flex:1;-ms-flex:1;flex:1;padding:0 17px;border:1px solid var(--primary);border-radius:6px 0 0 6px;outline:none;}/*!sc*/
@media (max-width:425px){.lfkzsQ{width:50%;min-width:50%;}}/*!sc*/
data-styled.g87[id="Searchbar__SeachInput-xnx3kr-0"]{content:"lfkzsQ,"}/*!sc*/
.jtCmJd{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-flex-direction:row;-ms-flex-direction:row;flex-direction:row;width:100%;height:100%;border-radius:5px;overflow:hidden;background-color:#f6f6f6;}/*!sc*/
data-styled.g88[id="Searchbar__Container-xnx3kr-1"]{content:"jtCmJd,"}/*!sc*/
.BVXNH{cursor:pointer;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;padding-right:29px;padding-left:29px;background:var(--primary);color:#fff;}/*!sc*/
@media (max-width:425px){.BVXNH{padding-right:5px;padding-left:5px;}}/*!sc*/
data-styled.g90[id="Searchbar__Button-xnx3kr-3"]{content:"BVXNH,"}/*!sc*/
.XBQPS{font-size:25px;}/*!sc*/
@media (max-width:768px){.XBQPS{font-size:20px;}}/*!sc*/
data-styled.g92[id="Searchbar___StyledMdSearch-xnx3kr-5"]{content:"XBQPS,"}/*!sc*/
.jCIuWZ{display:grid;grid-template-columns:repeat(auto-fill,minmax(200px,1fr));grid-gap:1vw;}/*!sc*/
@media (max-width:768px){.jCIuWZ{grid-template-columns:repeat(auto-fill,minmax(150px,1fr));grid-gap:1vw;}}/*!sc*/
data-styled.g246[id="algoliaConnectComponent__GridP-sc-1c85asy-0"]{content:"jCIuWZ,"}/*!sc*/
.jmbKPm{width:100%;max-width:100px;min-width:0;height:32px;padding:0 16px;-webkit-appearance:none;-moz-appearance:none;appearance:none;background-color:#f5f5fa;font-size:12px;border-radius:4px;}/*!sc*/
data-styled.g247[id="algoliaConnectComponent___StyledInput-sc-1c85asy-1"]{content:"jmbKPm,"}/*!sc*/
.eZHEjD{width:100%;max-width:100px;min-width:0;height:32px;padding:0 16px;-webkit-appearance:none;-moz-appearance:none;appearance:none;background-color:#f5f5fa;font-size:12px;color:#5d6494;border-radius:4px;}/*!sc*/
data-styled.g248[id="algoliaConnectComponent___StyledInput2-sc-1c85asy-2"]{content:"eZHEjD,"}/*!sc*/
.gqxLmc{display:block;height:32px;margin-left:8px;padding-left:16px;padding-right:16px;background:linear-gradient(90deg,#f5515f 0%,#9f041b 100%);color:#fff;border-radius:4px;box-shadow:0 4px 11px 0 rgba(37,44,97,0.15),0 2px 3px 0 rgba(93,100,148,0.2);-webkit-transition:all 0.2s ease-out;transition:all 0.2s ease-out;}/*!sc*/
data-styled.g249[id="algoliaConnectComponent___StyledButton-sc-1c85asy-3"]{content:"gqxLmc,"}/*!sc*/
.gWgnak{display:grid;grid-template-columns:6% 10% auto 25%;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;grid-template-areas:"logo menu search notification";}/*!sc*/
@media (max-width:768px){.gWgnak{grid-template-columns:25% 25% 25% 25%;grid-template-areas:"menu logo logo user" "notification notification notification notification" "search search search search";}.gWgnak .logo{justify-self:center;margin-bottom:1rem;max-width:76px;width:100%;}.gWgnak .menu{position:relative;justify-self:left;}}/*!sc*/
data-styled.g253[id="search-results__GridContainer-sc-6ln6mm-1"]{content:"gWgnak,"}/*!sc*/
.jpeNuX{min-height:3rem;}/*!sc*/
data-styled.g254[id="search-results___StyledDiv-sc-6ln6mm-2"]{content:"jpeNuX,"}/*!sc*/
.ejWvfj{right:30px;bottom:30px;background:linear-gradient(90deg,#f5515f 0%,#9f041b 100%);}/*!sc*/
@media (max-width:767px){.ejWvfj{bottom:75px;}}/*!sc*/
data-styled.g255[id="search-results___StyledButton-sc-6ln6mm-3"]{content:"ejWvfj,"}/*!sc*/
  </style>
  <link href="/static/manifest.json" rel="manifest"/>
  <title>
   E-valy Limited | Online Shopping Mall
  </title>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0, shrink-to-fit=no, maximum-scale=1.0, user-scalable=no" name="viewport"/>
  <meta content="E-valy Limited | Online Shopping Mall" property="og:title"/>
  <meta content="article" property="og:type"/>
  <meta content="https://s3-ap-southeast-1.amazonaws.com/media.evaly.com.bd/media/2019-08-04_090235.843922android-icon-200x200.png" property="og:image"/>
  <meta content="450" property="og:image:width"/>
  <meta content="298" property="og:image:height"/>
  <meta content="https://evaly.com.bd" property="og:url"/>
  <meta content="E-valy is an e-commerce site which will be capable of providing every kind of goods and products from every sector to every consumer located in Bangladesh." property="og:description"/>
  <link href="/static/images/icons/favicon.ico" rel="shortcut icon"/>
  <meta content="evaly://" property="al:android:url"/>
  <meta content="Evaly" property="al:android:app_name"/>
  <meta content="bd.com.evaly.evalymarchant" property="al:android:package"/>
  <meta content="14" name="next-head-count"/>
  <link as="style" href="/_next/static/css/d48fe9f040f8d2f97c7e.css" rel="preload"/>
  <link href="/_next/static/css/d48fe9f040f8d2f97c7e.css" rel="stylesheet"/>
  <link as="script" href="/_next/static/RZ7VftogY8QkgPiLg6BPz/pages/_app.js" rel="preload"/>
  <link as="script" href="/_next/static/RZ7VftogY8QkgPiLg6BPz/pages/search-results.js" rel="preload"/>
  <link as="script" href="/_next/static/runtime/webpack-6b3d3cda09a7b5b5debf.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/framework.7dfd02d307191d63a37e.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/b637e9a5.a705a21716e5b01f8145.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/0c9dcbbe.7fbd830a3d684b32423b.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/commons.afffbbb0420dd9af938a.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/6a597b002e9daab94e2e0adeb626acca4f1f6515.28c9d68d9749974f08e1.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/bba5516912876db85383b691379c4486ab998795.071cf6d38264238f2f49.js" rel="preload"/>
  <link as="script" href="/_next/static/runtime/main-3c89e50e2c7d7034f938.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/252f366e.32bec51017e26b1dae31.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/95b64a6e.a74dcc7937bf0c356811.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/d7eeaac4.afdce0938beabe8eef9a.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/2dc48ec14d05924f473dce007726385374c258b9.0a52afc0ae53472a590f.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/3ad14741d7bfb55e1bcea5bfc6670f090f0855af.b5af8ef4be1abd2d5791.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/f6d549f16f3909adbb4f9a302aacab15937bfbda.94c734c42c1caf61b869.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/a9dd91d4607a584382b3e8a70a910ee9fb417c65.cabb84905704185ea6f6.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/4cbc61372435748121077b3b94e57617b6c8338d.5ae2119035f5c9d8c81c.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/411365f484ca502253106aae57d21ae3bb416d15.2f90a1a0cb46996155b4.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/69ef8573555555a232f56c2d2a1de6a4101c15d0.d8f92afd6f8ceb35f607.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/5d7bf10f24bff82d5530a050de689a7c020a359b.36ce757546da64e3337c.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/c8a8012dbcfaeb41f17a667b3a927ba45766e4a2.312913bb8463128a068e.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/c1f80152d80b1129cab9e73f90501b8957be40a7.04f2303ad32c2682fab1.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/8d4460396e9219a79f33af22e0a8f4fe429b291e.cda426e58b75b281586e.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/57f045ed70322177467d785413f62aff844e25d2.ad35b737612878a9f01a.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/0378a7d7ac3f1a3f5f0e99380b068fe3a41b14e6.46f0a10d89a7db3593b1.js" rel="preload"/>
  <link as="script" href="/_next/static/chunks/680dd3e5bbe68ece4bf42804461f8830da8bd4e0.d71300269070cc46823a.js" rel="preload"/>
 </head>
 <body>
  <div id="__next">
   <div class="jsx-2334610719 min-h-screen pb-2" style="background-color:#F7F8FA">
    <div class="ais-InstantSearch__root">
     <div class="topbar bg-gray-100 py-1 text-gray-600 hidden md:block">
      <div class="container flex justify-between text-sm">
       <div class="flex">
        <div class="mr-4">
         <a href="https://merchant.evaly.com.bd/">
          <svg class="w-3 h-3 mr-1 inline align-baseline">
           <use href="/static/images/icons.svg#shop" xlink:href="/static/images/icons.svg#shop">
           </use>
          </svg>
          Merchant zone
         </a>
        </div>
        <div class="mr-4">
         <a href="/feeds">
          <svg class="w-3 h-3 mr-1 inline align-baseline">
           <use href="/static/images/icons.svg#newsfeed" xlink:href="/static/images/icons.svg#newsfeed">
           </use>
          </svg>
          News Feed
         </a>
        </div>
        <div class="mr-4">
         <a href="https://play.google.com/store/apps/details?id=bd.com.evaly.evalyshop">
          <svg class="w-3 h-3 mr-1 inline align-baseline">
           <use href="/static/images/icons.svg#mobile" xlink:href="/static/images/icons.svg#mobile">
           </use>
          </svg>
          Download App
         </a>
        </div>
       </div>
       <div class="flex">
        <div class="mr-4">
         <a href="https://www.facebook.com/groups/EvalyHelpDesk/">
          <svg class="w-3 h-3 mr-1 inline align-baseline">
           <use href="/static/images/icons.svg#help" xlink:href="/static/images/icons.svg#help">
           </use>
          </svg>
          <!-- -->
          Help
         </a>
        </div>
        <div>
         <a href="https://www.facebook.com/evaly.com.bd/">
          <svg class="w-3 h-3 mr-1 inline align-baseline">
           <use href="/static/images/icons.svg#facebook" xlink:href="/static/images/icons.svg#facebook">
           </use>
          </svg>
          <!-- -->
          Follow us
         </a>
        </div>
       </div>
      </div>
     </div>
     <div class="bg-white header" style="box-shadow:0 4px 16px 0 rgba(0,0,0,0.04)">
      <div class="search-results__Container-sc-6ln6mm-0 hFUCjp container py-5 px-8">
       <div class="search-results__GridContainer-sc-6ln6mm-1 gWgnak">
        <a class="logo xs:w-1/2" href="/" style="grid-area:logo">
         <img alt="logo" class="" src="/static/images/logo.svg" style="max-width:76px"/>
        </a>
        <button class="text-2xl menu md:block mb-4 md:mb-0" style="grid-area:menu">
         <svg class="m-auto text-gray-700" fill="currentColor" height="1em" stroke="currentColor" stroke-width="0" viewbox="0 0 24 24" width="1em" xmlns="http://www.w3.org/2000/svg">
          <path d="M3 18h18v-2H3v2zm0-5h18v-2H3v2zm0-7v2h18V6H3z">
          </path>
         </svg>
        </button>
        <div class="md:hidden mb-4" style="grid-area:user;justify-self:right">
         <button class="flex items-center">
          <span class="flex w-full items-center text-gray-700">
           <span>
            <svg color="#1D2531" fill="currentColor" height="25" size="25" stroke="currentColor" stroke-width="0" style="color:#1D2531" viewbox="0 0 1024 1024" width="25" xmlns="http://www.w3.org/2000/svg">
             <path d="M858.5 763.6a374 374 0 0 0-80.6-119.5 375.63 375.63 0 0 0-119.5-80.6c-.4-.2-.8-.3-1.2-.5C719.5 518 760 444.7 760 362c0-137-111-248-248-248S264 225 264 362c0 82.7 40.5 156 102.8 201.1-.4.2-.8.3-1.2.5-44.8 18.9-85 46-119.5 80.6a375.63 375.63 0 0 0-80.6 119.5A371.7 371.7 0 0 0 136 901.8a8 8 0 0 0 8 8.2h60c4.4 0 7.9-3.5 8-7.8 2-77.2 33-149.5 87.8-204.3 56.7-56.7 132-87.9 212.2-87.9s155.5 31.2 212.2 87.9C779 752.7 810 825 812 902.2c.1 4.4 3.6 7.8 8 7.8h60a8 8 0 0 0 8-8.2c-1-47.8-10.9-94.3-29.5-138.2zM512 534c-45.9 0-89.1-17.9-121.6-50.4S340 407.9 340 362c0-45.9 17.9-89.1 50.4-121.6S466.1 190 512 190s89.1 17.9 121.6 50.4S684 316.1 684 362c0 45.9-17.9 89.1-50.4 121.6S557.9 534 512 534z">
             </path>
            </svg>
           </span>
          </span>
         </button>
        </div>
        <div style="grid-area:search">
         <form action="" novalidate="" role="search">
          <div class="Searchbar__Container-xnx3kr-1 jtCmJd">
           <input class="Searchbar__SeachInput-xnx3kr-0 lfkzsQ" placeholder="Search..." type="search" value="remax 610"/>
           <figure class="Searchbar__Button-xnx3kr-3 BVXNH" color="black">
            <svg _css2="
    @media (max-width: ,768px,) {
      ,
            font-size:20px;
          ,
    }
  " class="Searchbar___StyledMdSearch-xnx3kr-5 XBQPS" color="white" fill="currentColor" height="1em" stroke="currentColor" stroke-width="0" style="color:white" viewbox="0 0 24 24" width="1em" xmlns="http://www.w3.org/2000/svg">
             <path d="M15.5 14h-.79l-.28-.27C15.41 12.59 16 11.11 16 9.5 16 5.91 13.09 3 9.5 3S3 5.91 3 9.5 5.91 16 9.5 16c1.61 0 3.09-.59 4.23-1.57l.27.28v.79l5 
4.99L20.49 19l-4.99-5zm-6 0C7.01 14 5 11.99 5 9.5S7.01 5 9.5 5 14 7.01 14 9.5 11.99 14 9.5 14z">
             </path>
            </svg>
           </figure>
          </div>
         </form>
        </div>
        <div class="md:pl-4 notification hidden md:block" style="grid-area:notification">
         <div class="flex justify-between items-center mb-4 mx-16 md:mx-0 md:mb-0 lg:ml-8">
          <button class="text-2xl menu md:hidden">
           <svg class="m-auto" fill="currentColor" height="1em" stroke="currentColor" stroke-width="0" viewbox="0 0 24 24" width="1em" xmlns="http://www.w3.org/2000/svg">
            <path d="M3 18h18v-2H3v2zm0-5h18v-2H3v2zm0-7v2h18V6H3z">
            </path>
           </svg>
          </button>
          <button class="relative">
           <svg color="#1D2531" fill="currentColor" height="25" size="25" stroke="currentColor" stroke-width="0" style="color:#1D2531" view

How to solve these problems? EDIT : using Selenium and Chrome Driver will be more time consuming for my project

Upvotes: 0

Views: 1163

Answers (1)

Vin
Vin

Reputation: 986

Try the below approach using requests and json. I have created the script with the API URL which is fetched by inspecting the network calls in chrome which are triggering on page load and then creating a dynamic form data to traverse on each and every page to get the data.

What exactly script is doing:

  1. First script will create a form data to query the the API call where page_no, query string and max values per facet(numbers of results to show) are dynamic where parameter page_no will increment by 1 upon completion of each traversal.

  2. Requests will get the data from the created form data and URL using POST method which will then pass to JSON to parse it and load in json format.

  3. Then from the parsed data script will traverse on the json object where data is actually present.

  4. Finally looping on all the batch of each and every page data one by one and printing.

Right now script is displaying few information you can access more information form the json object like i have done below.

import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from bs4 import BeautifulSoup as bs

def scrap_evaly_data():

QUERY = 'remax%20610' #query string can be changed to fetch another product data
MAX_VALUES_PER_FACET = 10 #no. of result show per page
page_no = 0 # default page no.
URL = 'https://eza2j926q5-3.algolianet.com/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(3.35.1)%3B%20Browser%20(lite)%3B%20react%20(16.13.1)%3B%20react-instantsearch%20(5.7.0)%3B%20JS%20Helper%20(2.28.1)&x-algolia-application-id=EZA2J926Q5&x-algolia-api-key=ca9abeea06c16b7d531694d6783a8f04' # API URL for querying

while True:
    print('Hold on creating new form data...')
    form_data = {
    "requests":[{"indexName":"products","params":"query=" + QUERY + "&maxValuesPerFacet=" + str(MAX_VALUES_PER_FACET) + "&page=" + str(page_no) + "&highlightPreTag=%3Cais-highlight-0000000000%3E&highlightPostTag=%3C%2Fais-highlight-0000000000%3E&facets=%5B%22price%22%2C%22category_name%22%2C%22brand_name%22%2C%22shop_name%22%2C%22color%22%5D&tagFilters="}]
    } # form_data which is dynamic and creates new set of results and send back
    response = requests.post(URL,json = form_data,verify = False) #requests for data using POST and JSON form data
    print('Created new form data going to fetch data...')

    result = json.loads(response.text) #load json data result
    if len(result) == 0: #condition to check whether result has length or not if not then break and come out from the while loop.
        break
    else:
        for item in result['results'][0]['hits']: #loop on the product information JSON object
            print('-' * 100)
            print('Brand Name: ', item['brand_name'])
            print('Category Name: ' , item['category_name'])
            print('Discount Price: ' , item['discounted_price'])
            print('Max Price: ' , item['max_price'])
            print('Min Price: ' , item['min_price'])
            print('Product Name: ' , item['name'])
            print('Product Image: ' , item['product_image'])
            print('Shop Item ID: ' , item['shop_item_id'])
            print('Shop Name: ' , item['shop_name'])
            print('Slug Info: ' , item['slug'])
            print('-' * 100)

        page_no +=1 #Increment the page number by 1 after each traversal


   scrap_evaly_data()

Actual Code

Upvotes: 1

Related Questions