Reputation: 75
I wrote the following code to scrape info from the website for a bookshop and save this data to a JSON file. The code works fine, however I wanted to use conditional statements to filter the data returned with a specific word in the title (in this case 'Gardening'). No matter what way I try to implement this, it's returning all the books and not just the ones I specified.
bookArray =[]
content = BeautifulSoup(open("mywebpage...",encoding="utf8"), "html.parser")
books = content.findAll('div', attrs={"class": "su-post"})
# book is a bs4.element.Tag object
for book in books:
titles = book.find('h2', attrs={"class": "su-post-title"})
titleText= titles.find('a').contents[0]
if 'Gardening' or 'gardening' in titleText:
dateAdded = book.find('div', attrs={"class": "su-post-meta"}).text
urls = titles.find('a').attrs['href'].split()
year = getPublishDate(url)
bookObject = {
"title": titleText,
"url": urls,
"year": year,
"dateAdded": dateAdded.strip('\n\t').replace('Posted: ','')}
bookArray.append(bookObject)
try:
with open('bookList.json', 'w') as outfile:
json.dump(bookArray, outfile)
except:
print("Write to file failed")
I also tried the following method but get the same output with all books written to the JSON
for book in books:
if 'Gardening' or 'gardening' in book.text():
#have also tried if 'Gardening' in book.string:
dateAdded = book.find('div', attrs={"class": "su-post-meta"}).text
...same as above
Finally some sample output of the JSON file showing that the conditional statements are not having any effect
[{
"title": "Of Mice and Men",
"url": ["http://mysite...."],
"year": "1937",
"dateAdded": "2020-08-11"
},
{
"title": "Wuthering Heights",
"url": ["http://mysite...."],
"year": "1847",
"dateAdded": "2020-06-06"
},
Further details: If I modify the code to print out every book, they are displayed in the following HTML format:
for book in books:
print(book)
<div class="su-post" id="su-post-6238">
<h2 class="su-post-title"><a href="ref to local file.../">Wuthering Heights</a></h2>
<div class="su-post-meta">Posted: 2020-06-06</div>
<div class="su-post-excerpt"></div>
</div>
<div class="su-post" id="su-post-8990">
<h2 class="su-post-title"><a href="ref to another local file...">Of Mice and Men</a></h2>
<div class="su-post-meta">Posted: 2020-08-11</div>
<div class="su-post-excerpt"></div>
</div>
Upvotes: 1
Views: 312
Reputation: 25048
Problem is:
if 'Gardening' or 'gardening' in titleText:
Because "Gardening" is truthy it was always evaluating to true.
Solution - Change to
if "Gardening" in titleText or "gardening" in titleText:
or
if "gardening" in titleText.lower():
or
if any( x in titleText for x in ['Gardening', 'gardening']):
Upvotes: 1