Newbie
Newbie

Reputation: 57

Index out of range error during python webscraping (beautiful soup)

I am using python program to webscrape a particular page. The code I am using is this.

#Area

    try:
        area= soup.find('div', 'location')
        result= str(area.get_text().strip().encode("utf-8"))
        # print([area_result])
        area_result=cleanup(result).split('>')[2].split(";")[0]
        nearby_result=cleanup(result).split('>')[2].split(";")[1]
        # nearby_result=cleanup(area_result).split('>')
        print "Area : ",area_result
        print "Nearby: ",nearby_result

        # print "Nearby : ",nearby_result

    except StandardError as e:
        area_result="Error was {0}".format(e)
        print area_result

def cleanup(s, remove=('\n', '\t')):
    newString = ''
    for c in s:
        # Remove special characters defined above.
        # Then we remove anything that is not printable (for instance \xe2)
        # Finally we remove duplicates within the string matching certain characters.
        if c in remove: continue
        elif not c in string.printable: continue
        elif len(newString) > 0 and c == newString[-1] and c in ('\n', ' ', ',', '.'): continue
        newString += c
    return newString

The website which i am trying to webscrape is this. The location info is at the right side bar. e.g. UAE ‪>‪ Dubai ‪>‪ Jumeirah Village ‪>‪ Jumeirah Village Circle ; 3.2 km from Dubai Autodrome

Error I am getting is this:

- Error was Index out of range 
 

Can anyone tell me how would i solve this error please seeing my code?

Note that not all the similar pages give this error.

Update: Tried mu's solution and getting this error now

Error was 'list' object has no attribute 'split'

Upvotes: 1

Views: 150

Answers (1)

Anshul Goyal
Anshul Goyal

Reputation: 76827

The problem is in these two lines, where you use the 3rd element (using index [2]), irrespective of the fact whether it exists or doesn't:

area_result=cleanup(result).split('>')[2].split(";")[0]
nearby_result=cleanup(result).split('>')[2].split(";")[1]

Instead, you can do something like below

cleanedup = cleanup(result).split('>')
if len(cleanedup) >= 3:
    results = cleanedup[2].split(";")
    if len(results) >= 2:
        area_result, nearby_result = results[0], results[1]

Upvotes: 1

Related Questions