Jawad Ahmad Khan
Jawad Ahmad Khan

Reputation: 309

Python BS4 find() find_all() returns empty lists

Hy, i am trying to scrape a web site https://www.dawn.com/pakistan but python find() find_all() method returns empty lists, i have tried the html5.parser, html5lib and lxml still no luck. Classes i am trying to scrape are present in the source code as well as in the soup object but things aren't seem to be working, any help will be appreciated thanks!

Code:

from bs4 import BeautifulSoup 

import lxml

import html5lib

import urllib.request

url1 = 'https://www.dawn.com/pakistan'


req = urllib.request.Request(
    url1, 
    data=None, 
    headers=
{
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
                        )
url1UrlContent=urllib.request.urlopen(req).read()
soup1=BeautifulSoup(url1UrlContent,'lxml')

url1Section1=soup1.find_all('h2', class_='story__title-size-five-text-black- 
font--playfair-display')
print(url1Section1)

Upvotes: 5

Views: 9952

Answers (2)

QHarr
QHarr

Reputation: 84465

I don't think you can pass compound class names like that. I use These are compound class names. I have used css selectors as a faster retrieval method. Compounds are filled with ".".

If you are after the headers you can use a slightly different selector combination

import requests
from bs4 import BeautifulSoup

url= 'https://www.dawn.com/pakistan'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
items = [item.text.strip() for item in soup.select('h2[data-layout=story] a')]
print(items)

To limit to just those on the left you can use:

items = [item.text.strip() for item in soup.select('.story__title.size-five.text-black.font--playfair-display a' )]

More broadly,

items = [item.text.strip() for item in soup.select('article [data-layout=story]')] 

As per your comment:

items = [item.text.strip() for item in soup.select('.col-sm-6.col-12')] 

Upvotes: 2

chitown88
chitown88

Reputation: 28565

yours should work as well (I used a different syntax). But it's the string that you have that doesn't match.

you have: 'story__title-size-five-text-black- font--playfair-display'

and I have : 'story__title size-five text-black font--playfair-display ' it's a very slight difference

replace:

url1Section1=soup1.find_all('h2', class_='story__title-size-five-text-black- font--playfair-display')

with:

url1Section1=soup1.find_all('h2', {'class':'story__title size-five text-black font--playfair-display '})

and see if that helps

Upvotes: 4

Related Questions