Reputation: 107
Playing around with BeautifulSoup working on my webscraper, and for some reason my links variable returns the blocks of code I specify, but as soon as I try to grab the "href" it only spits out "None".
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.kickstarter.com/discover/advanced?sort=most_funded")
pageGrab = BeautifulSoup(r.content, "html.parser")
#This comment below is another way I tried
#for link in pageGrab.find_all("div", {"class" : "project-profile-title text-truncate-xs"}):
links = pageGrab.find_all("div", {"class" : "project-profile-title text-truncate-xs"})
for link in links:
print (link.get("href"))
If I also run this script on say, reddit, for example, there are some links which are grabbed but the vast majority result in "None".
This has been my first target on the page for extracting the "href"
<a target="" href="/projects/getpebble/pebble-time-awesome-smartwatch-no-compromises?ref=most_funded">Pebble Time - Awesome Smartwatch, No Compromises</a>
Upvotes: 4
Views: 4223
Reputation: 241238
You are selecting the div
elements, which clearly don't have href
attributes.
You could simplify your code and use the .select()
method and target the children a
elements directly:
links = pageGrab.select('.project-profile-title.text-truncate-xs a')
for link in links:
print (link.get('href'))
Of course you could also use your existing code and chain the .find()
method after the div
elements; however, that assumes that the div
elements will always contain a
elements, therefore the code above would be safer to use.
divs = pageGrab.find_all("div", {"class" : "project-profile-title text-truncate-xs"})
for div in divs:
print (div.find('a').get("href"))
Additionally, if you want to take it a step further, the .select()
method accepts a majority of CSS selectors, which means that you could add the [href]
attribute selector in order to only select children anchor elements that have href
attributes:
links = pageGrab.select('.project-profile-title.text-truncate-xs a[href]')
for link in links:
print (link.get('href'))
Upvotes: 2
Reputation: 12178
links = pageGrab.find_all("div", {"class" : "project-profile-title text-truncate-xs"})
for link in links:
print (link.a.get("href")) # div dose not have href, use div.a find next a tag and get href
out:
/projects/getpebble/pebble-time-awesome-smartwatch-no-compromises?ref=most_funded
/projects/ryangrepper/coolest-cooler-21st-century-cooler-thats-actually?ref=most_funded
/projects/getpebble/pebble-2-time-2-and-core-an-entirely-new-3g-ultra?ref=most_funded
/projects/poots/kingdom-death-monster-15?ref=most_funded
/projects/getpebble/pebble-e-paper-watch-for-iphone-and-android?ref=most_funded
/projects/597538543/the-worlds-best-travel-jacket-with-15-features-bau?ref=most_funded
/projects/elanlee/exploding-kittens?ref=most_funded
/projects/ouya/ouya-a-new-kind-of-video-game-console?ref=most_funded
/projects/peak-design/the-everyday-backpack-tote-and-sling?ref=most_funded
/projects/antsylabs/fidget-cube-a-vinyl-desk-toy?ref=most_funded
Upvotes: 1