Reputation: 117
I want to scrape a specific part of the website Kickstarter.com
I need the strings of the Project-title. The website is structured and every project has this line.
<div class="Project-title">
My code looks like:
#Loading Libraries
import urllib
import urllib.request
from bs4 import BeautifulSoup
#define URL for scraping
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=popularity&seed=2448324&page=1"
thepage = urllib.request.urlopen(theurl)
#Cooking the Soup
soup = BeautifulSoup(thepage,"html.parser")
#Scraping "Project Title" (project-title)
project_title = soup.find('h6', {'class': 'project-title'}).findChildren('a')
title = project_title[0].text
print (title)
If I use the soup.find_all or set another value at the line Project_title[0] instead of zero, Python shows an error.
I need a list with all the project titles of this Website. Eg.:
Upvotes: 4
Views: 1148
Reputation: 572
With respect to the title of this post i would recommend you two different tutorial based on scraping particular data from a website . They do have a detailed explanation regarding how the task is achieved.
Firstly i would recommend to checkout pyimagesearch Scraping images using scrapy.
then you should try if you are more specific web scraping will help you.
Upvotes: 1
Reputation: 629
find()
only returns one element. To get all, you must use findAll
Here's the code you need
project_elements = soup.findAll('h6', {'class': 'project-title'})
project_titles = [project.findChildren('a')[0].text for project in project_elements]
print(project_titles)
We look at all the elements of tag h6
and class project-title
. We then take the title from each of these elements, and create a list with it.
Hope it helped, and don't hesitate to ask if you have any question.
edit : the problem of the above code is that it will fail if we do not get at least a child of tag a
for each element in the list returned by findAll
How to prevent this :
project_titles = [project.findChildren('a')[0].text for project in project_elements if project.findChildren('a')]
this will create the list only if the project.findChildren('a')
as at least one element. (if []
returns False)
edit : to get the description of the elements (class project-blurb
), let's look a bit at the HTML code.
<p class="project-blurb">
Bagel is a digital tape measure that helps you measure, organize, and analyze any size measurements in a smart way.
</p>
This is only a paragraph of class project-blurb
. To get them, we could use the same as we did to get the project_elements, or more condensed :
project_desc = [description.text for description in soup.findAll('p', {'class': 'project-blurb'})]
Upvotes: 2
Reputation: 180401
All the data you want is in the section with the css class staff-picks, just find the h6's with the project-title class and extract the text from the anchor tag inside:
soup = BeautifulSoup(thepage,"html.parser")
print [a.text for a in soup.select("section.staff-picks h6.project-title a")]
Output:
[u'The Superbook: Turn your smartphone into a laptop for $99', u'Weighitz: Weigh Smarter', u'Omega2: $5 IoT Computer with Wi-Fi, Powered by Linux', u"Bagel: The World's Smartest Tape Measure", u'FireFlies - Truly Wire-Free Earbuds - Music Without Limits!', u'ISOLATE\xae - Switch off your ears!']
Or using find with find_all:
project_titles = soup.find("section",class_="staff-picks").find_all("h6", "project-title")
print([proj.a.text for proj in project_titles])
There is also only one anchor tag inside each h6 tag so you cannot end up with more than one whatever approach you take.
Upvotes: 0