muntazir nawani
muntazir nawani

Reputation: 23

Cannot get headlines content while scraping

I am new to scraping but I have tried every method to solve this problem but not getting the desired results. I want to scrape this site https://www.accesswire.com/newsroom/ and I want to scrape all the headlines, headlines show up when I inspect them in browser but after scraping with bs4 or selenium, I do not get the full page-Source code and also don't get the headlines as well.

I have tried time.sleep(10) but that is also not working out for me. I used selenium to get the page but that also wouldn't work for me as well. div.column-15 w-col w-col-9 this is the class, div where headlines reside

ua     = UserAgent()
header = {'user-agent':ua.chrome}
url = "https://www.accesswire.com/newsroom/"
response = requests.get(url, headers=header)
time.sleep(12)
soup = BeautifulSoup(response.content, 'html.parser')
time.sleep(12)
headline_Div = soup.find("div",{"class":"column-15 w-col w-col-9"})
print(headline_Div)

I just want to get all the headlines and headlines links on this page or at least a full page-source should be displayed so that I can manipulate it by myself.

Upvotes: 1

Views: 159

Answers (2)

QHarr
QHarr

Reputation: 84465

You don't need selenium. Just use the more efficient requests and the API which the page uses

import re
import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.accesswire.com/api/newsroom.ashx')
p = re.compile(r" \$\('#newslist'\)\.after\('(.*)\);")
html = p.findall(r.text)[0]
soup = bs(html, 'lxml')
headlines = [(item.text, item['href']) for item in soup.select('a.headlinelink')]
print(headlines)

Regex explanation:

Try regex here

Upvotes: 2

Dalvenjia
Dalvenjia

Reputation: 2033

If pull and parse is not working is because the content is dynamic, you will need selenium for the actual browser to generate the content for you

from selenium import webdriver

driver = webdriver.Firefox()
driver.get('https://www.accesswire.com/newsroom/')
headline_links = driver.find_elements_by_css_selector('a.headlinelink')
headlines = [link.get_attribute('textContent') for link in headline_links]

Upvotes: 0

Related Questions