user11035198
user11035198

Reputation:

Why is requests.get() not working in a for loop?

I'm trying to scrape a list of websites which are listed in the text file 'tastyrecipes', I currently have a for loop which returns the urls, but can't figure out how to put the urls into requests.get() without getting a 404 error. The websites return a 200 status code individually and there's no problems viewing the HTML.

I have tried string formatting, where I did

with open('tastyrecipes', 'r') as f:
    for i in f:
        source = requests.get("{0}".format(i)) 

however this didn't change the result.

with open('tastyrecipes', 'r') as f:
    new_file = open("recipecorpus.txt", "a+")
    for i in f:
        source = requests.get(i)
        content = source.content
        soup = BeautifulSoup(content, 'lxml')
        list_object = soup.find('ol', class_='prep-steps list-unstyled xs-text-3')
        method = list_object.text
        new_file.write(method)
        new_file.close()

I anticipated i to allow for iterative scraping over the urls in the text file, however it returns a 404 error.

Upvotes: 1

Views: 1898

Answers (3)

paras chauhan
paras chauhan

Reputation: 787

first check url is valid or not from urlparse import urlsplit def is_valid_url(url=''): url_parts = urlsplit(url) return url_parts.scheme and url_parts.netloc and surl_partsp.path

with open('tastyrecipes', 'r') as f: new_file = open("recipecorpus.txt", "a+") for i in f: if is_valid_url(i) source = requests.get(i) content = source.content soup = BeautifulSoup(content, 'lxml') list_object = soup.find('ol', class_='prep-steps list-unstyled xs-text-3') method = list_object.text new_file.write(method) new_file.close()

Upvotes: 0

Alex Yu
Alex Yu

Reputation: 3537

Analysis

It was impossible for me to find a problem with requests.get per se.

import requests
recipes=['https://tasty.co/recipe/deep-fried-ice-cream-dogs',
        'https://tasty.co/recipe/fried-shrimp-and-mango-salsa-hand-rolls',
         'https://tasty.co/recipe/brigadeiros']
print(list(map(requests.get, recipes)))
[<Response [200]>, <Response [200]>, <Response [200]>]

for recipe in recipes: print(requests.get(recipe))
<Response [200]>
<Response [200]>
<Response [200]>

Possible problems

1. 404 is not a problem itself

It is a legitimate answer if there are incorrect urls.

2. Trailing \n and whitespaces in tastyrecipes-file

That was suggested by @jwodder

Upvotes: 0

jwodder
jwodder

Reputation: 57610

The lines i in the file f are returned with trailing newlines, which do not belong in normal URLs. You need to remove the newlines with i = i.rstrip('\r\n') before passing i to requests.get().

Upvotes: 1

Related Questions