NoobCoder
NoobCoder

Reputation: 675

Not able to scrape to content using Beautifulsoup

I am trying to scrape this website My code for scraping website is

ua1 = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
ua2 = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome'
headers = {'User-Agent': ua2,
           'Accept': 'text/html,application/xhtml+xml,application/xml;' \
                     'q=0.9,image/webp,*/*;q=0.8'}
session = requests.Session()
response = session.get("website--link", headers=headers)
webContent = response.content


root_tag=["div", {"class": "qtxgkq-0"}]
image_tag=["img",{"":""},"src"]

bs = BeautifulSoup(webContent, 'lxml')
all_tab_data = bs.findAll(root_tag[0], root_tag[1])

output=[]
for div in all_tab_data:
    image_url = None
    div_img = str(div)
    match = re.search(r"(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|gif|png|jpeg)", div_img)
    if match!=None:
        image_url = match.group(0)
    else:
        image_url = div.find(image_tag[0],image_tag[1]).get(image_tag[2])
    if image_url!=None:
        if image_url[0] == '/' and image_url[1] != '/':
            image_url = main_url + image_url
    print(image_url)
    output.append(image_url)

I am getting empty list Although i am picking the correct tag. I also tried to change root tag to

root_tag=["div", {"class": "b01o18-0 kpPYYo"}]

still getting the empty list

Upvotes: 0

Views: 161

Answers (1)

The Pjot
The Pjot

Reputation: 1859

Your code is fine, but you missed one important part. They render that part of the site via javascript, which your request won't do ;) You just get the html. But the data is there, just not where you expect it to be. It's in a script tag as a json.

import json
data = json.loads(bs.findAll('script', {'id': '__NEXT_DATA__'})[0].text)

And go from there.

for article in data['props']['pageProps']['articles']:
    image_url = article['image']['url']
    if not image_url.startswith('http'):
        image_url = 'https:' + image_url
    print(image_url)
    # They use slug to build their news url, it's relative.
    slug = article['slug']
    # full url to news article
    news_url = f'{main_url}/{slug}'

Upvotes: 1

Related Questions