Reputation: 57
I am using python program to webscrape a particular page. The code I am using is this.
#Area
try:
area= soup.find('div', 'location')
result= str(area.get_text().strip().encode("utf-8"))
# print([area_result])
area_result=cleanup(result).split('>')[2].split(";")[0]
nearby_result=cleanup(result).split('>')[2].split(";")[1]
# nearby_result=cleanup(area_result).split('>')
print "Area : ",area_result
print "Nearby: ",nearby_result
# print "Nearby : ",nearby_result
except StandardError as e:
area_result="Error was {0}".format(e)
print area_result
def cleanup(s, remove=('\n', '\t')):
newString = ''
for c in s:
# Remove special characters defined above.
# Then we remove anything that is not printable (for instance \xe2)
# Finally we remove duplicates within the string matching certain characters.
if c in remove: continue
elif not c in string.printable: continue
elif len(newString) > 0 and c == newString[-1] and c in ('\n', ' ', ',', '.'): continue
newString += c
return newString
The website which i am trying to webscrape is this. The location info is at the right side bar. e.g. UAE > Dubai > Jumeirah Village > Jumeirah Village Circle ; 3.2 km from Dubai Autodrome
Error I am getting is this:
- Error was Index out of range
Can anyone tell me how would i solve this error please seeing my code?
Note that not all the similar pages give this error.
Update: Tried mu's solution and getting this error now
Error was 'list' object has no attribute 'split'
Upvotes: 1
Views: 150
Reputation: 76827
The problem is in these two lines, where you use the 3rd element (using index [2]
), irrespective of the fact whether it exists or doesn't:
area_result=cleanup(result).split('>')[2].split(";")[0]
nearby_result=cleanup(result).split('>')[2].split(";")[1]
Instead, you can do something like below
cleanedup = cleanup(result).split('>')
if len(cleanedup) >= 3:
results = cleanedup[2].split(";")
if len(results) >= 2:
area_result, nearby_result = results[0], results[1]
Upvotes: 1