Reputation: 51
I'm trying to scrape news information from https://hk.appledaily.com/search/apple.
And I need to get the news content from div class="flex-feature"
but it only return []
. Hope anyone could help, thank you!
from bs4 import BeautifulSoup
import requests
page = requests.get("https://hk.appledaily.com/search/apple")
soup = BeautifulSoup(page.content, 'lxml')
results = soup.find_all('div', class_ = "flex-feature")
print(results)
Upvotes: 2
Views: 289
Reputation: 7558
The data on that page is fetched and rendered dynamically (via js). So you wouldn't be able to fetch the data unless you evaluate the javascript.
One approach to scrape the data would be to use a headless browser.
Here is one such example using pyppeteer.
import asyncio
from pyppeteer import launch
# https://pypi.org/project/pyppeteer/
URL = 'https://hk.appledaily.com/search/apple'
async def main():
browser = await launch()
page = await browser.newPage()
await page.goto(URL)
await page.waitForSelector(".flex-feature")
elements = await page.querySelectorAll('.flex-feature')
for el in elements:
text = await page.evaluate('(el) => el.textContent', el)
print(text)
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
output:
3小時前特朗普確診 不斷更新 特朗普新聞秘書及多名白宮職員確診 「白宮群組」持續擴大特朗普確診 不斷更新
... REDUCTED ...
Upvotes: 1
Reputation: 550
If you View page source in your browser, you'll see that flex-feature
is nowhere in the HTML. This is the HTML that the server initially sends back before rendering JavaScript and all the dynamic content. This is also the same HTML that requests.get
is going to give you ([]).
To access these elements, you'll likely want to use something such as Selenium that will allow you to automate a browser and render the JavaScript that is dynamically loading the page. Check out my answer to a similar question here for some insight!
Additional resources:
Upvotes: 1