Reputation: 101
I'm having some troubles with this code, where I try to take all Pokemon's names from pokedex.org
. My original code is the following :
import requests
from bs4 import BeautifulSoup
url = 'https://pokedex.org/'
html = BeautifulSoup(requests.get(url).content,'lxml')
uls = html.find('ul', attrs = {'id':'monsters-list'})
print(uls.prettify())
Then, uls
should contain some <li></li>
which themselves contain <span></span>
where the name is wrapped in. It works quite well taking all the content for the exact 100 first Pokemons, but then it returns me empty <li></li>
for the 500 others. I've tried different parsers such as html.parser
, html5lib
and lxml
but it doesn't change anything.
Upvotes: 1
Views: 3133
Reputation: 106
It looks like the element is being created by JavaScript, but requests can't handle dynamically generated elements by JavaScript. (correct me if i'm wrong)
I suggest using selenium together with ChromeWebDriver to get the page source, then you can use BeautifulSoup for parsing.
(Assuming you use chrome browser)
finally we get to the code
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# headless background execution
Options = Options()
Options.headless = True
url = "https://pokedex.org/"
browser = webdriver.Chrome(options=Options)
browser.get(url)
html = BeautifulSoup(requests.get(url).content, 'lxml')
uls = html.find('ul', attrs={'id': 'monsters-list'})
print(uls.prettify())
Upvotes: 1
Reputation: 2609
The page is loaded dynamically, therefore requests
won't support it. We can use Selenium as an alternative to scrape the page and need scroll page down also.
Install it with: pip install selenium
.
Download the correct ChromeDriver from here. Here is code :
from bs4 import BeautifulSoup
from selenium import webdriver
import time
url = 'https://pokedex.org/'
webdriver = webdriver.Chrome()
webdriver.get(url)
time.sleep(2)
webdriver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
html = BeautifulSoup(webdriver.page_source,'lxml')
uls = html.find('ul', attrs = {'id':'monsters-list'})
print(uls.prettify())
Output last item :
<li style="background: linear-gradient(90deg, #B8B8D0 50%, #A8B820 50%)">
<button class="monster-sprite sprite-649" type="button">
</button>
<span>
Genesect
</span>
</li>
Upvotes: 2