pynewbee
pynewbee

Reputation: 679

My Beautiful Soup scraper is not working as intended

I am trying to pull the ingredients list from the following webpage:

https://skinsalvationsf.com/2012/08/updated-comedogenic-ingredients-list/

So the first ingredient I want to pull would be Acetylated Lanolin, and the last ingredient would be Octyl Palmitate.

Looking at the page source for this URL, I learn that the pattern for the ingredients list looks like this:

<td valign="top" width="33%">Acetylated Lanolin <sup>5</sup></td>

So I wrote some code to pull the list, and it is giving me zero results. Below is the code.

import requests
r = requests.get('https://skinsalvationsf.com/2012/08/updated-comedogenic-ingredients-list/')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

results = soup.find_all('td', attrs={'valign':'top'})

When I try len(results), it gives me a zero.

What am I doing wrong? Why am I not able to pull the list as intended? I am a beginner to web scrapers.

Upvotes: 1

Views: 100

Answers (2)

Morse
Morse

Reputation: 9125

Your soup request is forbidden.

Hence you can not crawl it. Seems website is blocking scraping.

print(soup)

<html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<center><h1>403 Forbidden</h1></center>
<hr/><center>nginx</center>
</body>
</html>

Upvotes: -1

Julien Spronck
Julien Spronck

Reputation: 15423

Your web scraping code is working as intended. However, your request did not work. If you check the status code of your request, you can see that you get a 403 status.

r = requests.get('https://skinsalvationsf.com/2012/08/updated-comedogenic-ingredients-list/')
print(r.status_code) # 403

What happens is that the server does not allow a non-browser request. To make it work, you need to use a header while making the request. This header should be similar to what a browser would send:

headers = {
    'User-Agent': ('Mozilla/5.0 (Windows NT 6.1; WOW64) '
                   'AppleWebKit/537.36 (KHTML, like Gecko) '
                   'Chrome/56.0.2924.76 Safari/537.36')
}

r = requests.get('https://skinsalvationsf.com/2012/08/updated-comedogenic-ingredients-list/', headers=headers)

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('td', attrs={'valign':'top'})
print(len(results))

Upvotes: 2

Related Questions