Reputation:
I'm trying to scrape a list of websites which are listed in the text file 'tastyrecipes', I currently have a for loop which returns the urls, but can't figure out how to put the urls into requests.get() without getting a 404 error. The websites return a 200 status code individually and there's no problems viewing the HTML.
I have tried string formatting, where I did
with open('tastyrecipes', 'r') as f:
for i in f:
source = requests.get("{0}".format(i))
however this didn't change the result.
with open('tastyrecipes', 'r') as f:
new_file = open("recipecorpus.txt", "a+")
for i in f:
source = requests.get(i)
content = source.content
soup = BeautifulSoup(content, 'lxml')
list_object = soup.find('ol', class_='prep-steps list-unstyled xs-text-3')
method = list_object.text
new_file.write(method)
new_file.close()
I anticipated i to allow for iterative scraping over the urls in the text file, however it returns a 404 error.
Upvotes: 1
Views: 1898
Reputation: 787
first check url is valid or not
from urlparse import urlsplit
def is_valid_url(url=''):
url_parts = urlsplit(url)
return url_parts.scheme and url_parts.netloc and surl_partsp.path
with open('tastyrecipes', 'r') as f:
new_file = open("recipecorpus.txt", "a+")
for i in f:
if is_valid_url(i)
source = requests.get(i)
content = source.content
soup = BeautifulSoup(content, 'lxml')
list_object = soup.find('ol', class_='prep-steps list-unstyled xs-text-3')
method = list_object.text
new_file.write(method)
new_file.close()
Upvotes: 0
Reputation: 3537
It was impossible for me to find a problem with requests.get
per se.
import requests
recipes=['https://tasty.co/recipe/deep-fried-ice-cream-dogs',
'https://tasty.co/recipe/fried-shrimp-and-mango-salsa-hand-rolls',
'https://tasty.co/recipe/brigadeiros']
print(list(map(requests.get, recipes))) [<Response [200]>, <Response [200]>, <Response [200]>] for recipe in recipes: print(requests.get(recipe)) <Response [200]> <Response [200]> <Response [200]>
It is a legitimate answer if there are incorrect urls.
tastyrecipes
-fileThat was suggested by @jwodder
Upvotes: 0
Reputation: 57610
The lines i
in the file f
are returned with trailing newlines, which do not belong in normal URLs. You need to remove the newlines with i = i.rstrip('\r\n')
before passing i
to requests.get()
.
Upvotes: 1