paulf
paulf

Reputation: 75

BeautifulSoup4 using conditional statements with tag object

I wrote the following code to scrape info from the website for a bookshop and save this data to a JSON file. The code works fine, however I wanted to use conditional statements to filter the data returned with a specific word in the title (in this case 'Gardening'). No matter what way I try to implement this, it's returning all the books and not just the ones I specified.

bookArray =[]
content = BeautifulSoup(open("mywebpage...",encoding="utf8"), "html.parser")
    books = content.findAll('div', attrs={"class": "su-post"})
    # book is a bs4.element.Tag object

for book in books:
    titles = book.find('h2', attrs={"class": "su-post-title"})
    titleText= titles.find('a').contents[0]
    if 'Gardening' or 'gardening' in titleText:
            dateAdded = book.find('div', attrs={"class": "su-post-meta"}).text
            urls = titles.find('a').attrs['href'].split()
            year = getPublishDate(url)
            bookObject = {
            "title": titleText,
            "url": urls,
            "year": year, 
            "dateAdded": dateAdded.strip('\n\t').replace('Posted: ','')}
            bookArray.append(bookObject)

try:
    with open('bookList.json', 'w') as outfile:
        json.dump(bookArray, outfile)
    except:
        print("Write to file failed")

I also tried the following method but get the same output with all books written to the JSON

for book in books:
        if 'Gardening' or 'gardening' in book.text(): 
            #have also tried if 'Gardening' in book.string:
            
            dateAdded = book.find('div', attrs={"class": "su-post-meta"}).text
            ...same as above

Finally some sample output of the JSON file showing that the conditional statements are not having any effect

[{
    "title": "Of Mice and Men",
    "url": ["http://mysite...."],
    "year": "1937",
    "dateAdded": "2020-08-11"
},
{
    "title": "Wuthering Heights",
    "url": ["http://mysite...."],
    "year": "1847",
    "dateAdded": "2020-06-06"
},

Further details: If I modify the code to print out every book, they are displayed in the following HTML format:

for book in books:
    print(book)
<div class="su-post" id="su-post-6238">
    <h2 class="su-post-title"><a href="ref to local file.../">Wuthering Heights</a></h2>
    <div class="su-post-meta">Posted: 2020-06-06</div>
    <div class="su-post-excerpt"></div>
</div>
<div class="su-post" id="su-post-8990">
    <h2 class="su-post-title"><a href="ref to another local file...">Of Mice and Men</a></h2>
    <div class="su-post-meta">Posted: 2020-08-11</div>
    <div class="su-post-excerpt"></div>
</div>

Upvotes: 1

Views: 312

Answers (1)

HedgeHog
HedgeHog

Reputation: 25048

Problem is:

if 'Gardening' or 'gardening' in titleText:

Because "Gardening" is truthy it was always evaluating to true.

Solution - Change to

if "Gardening" in titleText or "gardening" in titleText:

or

if "gardening" in titleText.lower():

or

if any( x in titleText for x in ['Gardening', 'gardening']):

Upvotes: 1

Related Questions