GaëtanLF
GaëtanLF

Reputation: 101

BeautifulSoup doesn't take the full HTML code

I'm having some troubles with this code, where I try to take all Pokemon's names from pokedex.org. My original code is the following :

import requests
from bs4 import BeautifulSoup

url = 'https://pokedex.org/'
html = BeautifulSoup(requests.get(url).content,'lxml')

uls = html.find('ul', attrs = {'id':'monsters-list'})

print(uls.prettify())

Then, uls should contain some <li></li> which themselves contain <span></span> where the name is wrapped in. It works quite well taking all the content for the exact 100 first Pokemons, but then it returns me empty <li></li> for the 500 others. I've tried different parsers such as html.parser, html5lib and lxml but it doesn't change anything.

Upvotes: 1

Views: 3133

Answers (2)

KailasMM
KailasMM

Reputation: 106

It looks like the element is being created by JavaScript, but requests can't handle dynamically generated elements by JavaScript. (correct me if i'm wrong)

I suggest using selenium together with ChromeWebDriver to get the page source, then you can use BeautifulSoup for parsing.

(Assuming you use chrome browser)

  1. visit: chrome://settings/help and check your chrome version
  2. download suitable version of chromewebdriver from official website (https://chromedriver.chromium.org/downloads)
  3. place chromedriver.exe and python file in same directory

finally we get to the code

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# headless background execution
Options = Options()
Options.headless = True

url = "https://pokedex.org/"
browser = webdriver.Chrome(options=Options)
browser.get(url)

html = BeautifulSoup(requests.get(url).content, 'lxml')
uls = html.find('ul', attrs={'id': 'monsters-list'})

print(uls.prettify())

Upvotes: 1

Samsul Islam
Samsul Islam

Reputation: 2609

The page is loaded dynamically, therefore requests won't support it. We can use Selenium as an alternative to scrape the page and need scroll page down also.

Install it with: pip install selenium.

Download the correct ChromeDriver from here. Here is code :

from bs4 import BeautifulSoup
from selenium import webdriver
import time

url = 'https://pokedex.org/'
webdriver = webdriver.Chrome()
webdriver.get(url)
time.sleep(2)

webdriver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
html = BeautifulSoup(webdriver.page_source,'lxml')

uls = html.find('ul', attrs = {'id':'monsters-list'})

print(uls.prettify())

Output last item :

<li style="background: linear-gradient(90deg, #B8B8D0 50%, #A8B820 50%)">
  <button class="monster-sprite sprite-649" type="button">
  </button>
  <span>
   Genesect
  </span>
 </li>

Upvotes: 2

Related Questions