Reputation: 675
I am trying to scrape this website My code for scraping website is
ua1 = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
ua2 = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome'
headers = {'User-Agent': ua2,
'Accept': 'text/html,application/xhtml+xml,application/xml;' \
'q=0.9,image/webp,*/*;q=0.8'}
session = requests.Session()
response = session.get("website--link", headers=headers)
webContent = response.content
root_tag=["div", {"class": "qtxgkq-0"}]
image_tag=["img",{"":""},"src"]
bs = BeautifulSoup(webContent, 'lxml')
all_tab_data = bs.findAll(root_tag[0], root_tag[1])
output=[]
for div in all_tab_data:
image_url = None
div_img = str(div)
match = re.search(r"(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|gif|png|jpeg)", div_img)
if match!=None:
image_url = match.group(0)
else:
image_url = div.find(image_tag[0],image_tag[1]).get(image_tag[2])
if image_url!=None:
if image_url[0] == '/' and image_url[1] != '/':
image_url = main_url + image_url
print(image_url)
output.append(image_url)
I am getting empty list Although i am picking the correct tag. I also tried to change root tag to
root_tag=["div", {"class": "b01o18-0 kpPYYo"}]
still getting the empty list
Upvotes: 0
Views: 161
Reputation: 1859
Your code is fine, but you missed one important part. They render that part of the site via javascript, which your request won't do ;) You just get the html. But the data is there, just not where you expect it to be. It's in a script tag as a json.
import json
data = json.loads(bs.findAll('script', {'id': '__NEXT_DATA__'})[0].text)
And go from there.
for article in data['props']['pageProps']['articles']:
image_url = article['image']['url']
if not image_url.startswith('http'):
image_url = 'https:' + image_url
print(image_url)
# They use slug to build their news url, it's relative.
slug = article['slug']
# full url to news article
news_url = f'{main_url}/{slug}'
Upvotes: 1