Reputation: 133
I am trying to crawl data from a list of url (1st loop) . And in each url (2nd loop), I want to adjust the product_reviews['reviews'] ( list) by adding more data. Here is my code :
import requests
import pandas as pd
df = pd.read_excel(r'C:\ids.xlsx')
ids = df['ids'].values.tolist()
link = 'https://www.real.de/product/%s/'
url_test = 'https://www.real.de/pdp-test/api/v1/%s/product-attributes/?offset=0&limit=500'
url_test1 = 'https://www.real.de/pdp-test/api/v1/%s/product-reviews/?offset=0&limit=500'
for i in ids:
product_id = requests.get(url_test %i).json()
product_reviews = requests.get(url_test1 %i).json()
for x in range(0,len(product_reviews['reviews']),1):
product_reviews['reviews'][x]['variantAttributes'].append(str(int(100*float(product_reviews['reviews'][x]['variantAttributes'][1]['label'].replace(" m","").replace(",",".")))))
product_reviews['reviews'][x]['variantAttributes'].append(str(int(100*float(product_reviews['reviews'][x]['variantAttributes'][0]['label'].replace(" m","").replace(",",".")))))
product_reviews['reviews'][x]['size']= str(int(100*float(product_reviews['reviews'][x]['variantAttributes'][1]['label'].replace(" m","").replace(",","."))))+ 'x' + str(int(100*float(product_reviews['reviews'][x]['variantAttributes'][0]['label'].replace(" m","").replace(",","."))))
product_reviews['reviews'][x]['url'] = link %i
product_reviews['reviews'][x]['ean'] = product_id['defaultAttributes'][0]['values'][0]['text']
product_reviews['reviews'][x]['TotalReviewperParent'] = product_reviews['totalReviews']
df = pd.DataFrame(product_reviews['reviews'])
df.to_excel( r'C:\new\str(i).xlsx', index=False)
However when I run this code, it returns error :
line 24, in product_reviews['reviews'][x]['variantAttributes'].append(str(int(100*float(product_reviews['reviews'][x]['variantAttributes'][1]['label'].replace(" m","").replace(",",".")))))
IndexError: list index out of range
When I run the 2nd loop for 1 url, it runs fine, however when I put 2nd loop inside 1st loop, it returns error. What is the solution for it ? And my code seems so monkey. Do you know how to improve my code so it can be shorter ?
Upvotes: 1
Views: 97
Reputation: 22837
Please, in the future, try to create a Minimal, Reproducible Example. We don't have access to your 'ids.xlsx' so we can't verify if the problem is with a specific id in your list or a general problem.
Taking a random id, 338661983
, and using the following code:
import requests
link = 'https://www.real.de/product/%s/'
url_attributes = 'https://www.real.de/pdp-test/api/v1/%s/product-attributes/?offset=0&limit=500'
url_reviews = 'https://www.real.de/pdp-test/api/v1/%s/product-reviews/?offset=0&limit=500'
ids = [338661983]
for i in ids:
product_id = requests.get(url_attributes % i).json()
product_reviews = requests.get(url_reviews % i).json()
for review in product_reviews['reviews']:
print(review)
break
I get the following output:
{'reviewId': 1119427, 'title': 'Klasse!', 'date': '11.11.2020', 'rating': 5, 'isVerifiedPurchase': True, 'text': 'Originale Switch, schnelle Lieferung. Alles Top ', 'variantAttributes': [], 'author': 'hm-1511917085', 'datePublished': '2020-11-11T20:09:41+01:00'}
Notice that variantAttributes
is an empty list.
You get an IndexError because you try to take the element at position 1 of that empty list in:
review['variantAttributes'][1]['label'].replace(" m","").replace(",",".")
Upvotes: 1