modestmotion
modestmotion

Reputation: 37

Comparing to a list in python using BeautifulSoup

The following code is meant to look through tags in a webpage (the 'b', 'strong' and 'a' tags in specific 'li' entries). If the tag is found in a list (which can be found in the code) then the 'a class=vote-description__evidence' tag is added to another list - otherwise 0 is added to this list. The code can be found here:

import urllib2
from BeautifulSoup import *

def votedescget(link):
    response = urllib2.urlopen(link)
    html = response.read()
    soup = BeautifulSoup(html)
    desc = soup.findAll('ul',{'class':"vote-descriptions"})
    readVotes = open("categories.txt","r")
    #descList = []

    #for line in readVotes.read().splitlines():
        #descList.append(line)

    resultsList = []
    descList = ['<b>gay rights</b>', '<b>smoking bans</b>', '<b>hunting ban</b>', '<b>marriage</b>', '<b>equality and human rights</b>', '<b>assistance to end their life</b>', '<b>UK military forces</b>', '<b>Iraq war</b>', '<strong>investigations</strong>', '<b>Trident</b>', '<b>EU integration</b>', '<b>EU</b>', '<b>Military Covenant</b>', '<b>right to remain for EU nationals</b>', '<b>UK membership of the EU</b>', '<b>military action against <a href="https://en.wikipedia.org/wiki/Islamic_State_of_Iraq_and_the_Levant">ISIL (Daesh)</a></b>', '<b>housing benefit</b>', '<b>welfare benefits</b>', '<b>illness or disability</b>', '<b>council tax</b>', '<b>welfare benefits</b>', '<b>guaranteed jobs for young people</b>', '<b>income tax</b>', '<b>rate of VAT</b>', '<b>alcoholic drinks</b>', '<b>taxes on plane tickets</b>', '<b>fuel for motor vehicles</b>', '<b>income over &pound;150,000</b>', '<b>occupational pensions</b>', '<b>occupational pensions</b>', '<b>banker&rsquo;s bonus tax</b>', '<b>taxes on banks</b>', '<b>mansion tax</b>', '<b>rights for shares</b>', '<b>regulation of trade union activity</b>', '<b>capital gains tax</b>', '<b>corporation tax</b>', '<b>tax avoidance</b>', '<b>incentives for companies to invest</b>', '<b>high speed rail</b>', '<b>private patients</b>', '<b>NHS</b>', '<b>foundation hospitals</b>', '<b>smoking bans</b>', '<b>assistance to end their life</b>', '<b>autonomy for schools</b>', '<b>undergraduate tuition fee</b>', '<a href="https://en.wikipedia.org/wiki/Academy_(English_school)">academy schools</a>', '<b>financial support</b>', '<b>tuition fees</b>', '<b>funding of local government</b>', '<b>equal number of electors</b>', '<b>fewer MPs</b>', '<b>transparent Parliament</b>', '<a href="https://en.wikipedia.org/wiki/Proportional_representation">proportional system</a>', '<strong>wholly elected</strong>', '<b>taxes on business premises</b>', '<b>campaigning by third parties</b>', '<b>fixed periods between parliamentary elections</b>', '<b>hereditary peers</b>', '<b>more powers to the Welsh Assembly</b>', '<b>more powers to the Scottish Parliament</b>', '<b>powers for local councils</b>', '<b>over laws specifically impacting their part of the UK</b>', '<b>voting age</b>', '<b>stricter asylum system</b>', '<b>intervene in inquests</b>', '<b>ID cards</b>', '<b>Police and Crime Commissioners</b>', '<b>retention of information about communications</b>', '<b>enforcement of immigration rules</b>', '<b>mass surveillance</b>', '<b>merging police and fire services</b>', '<b>prevent climate change</b>', '<b>fuel for motor vehicles</b>', '<b>forests</b>', '<b>taxes on plane tickets</b>', '<b>electricity generation</b>', '<b>culling badgers</b>', '<b>hydraulic fracturing (fracking)</b>', '<b>high speed rail</b>', '<b>bus services</b>', '<b>rail fares</b>', '<b>fuel for motor vehicles</b>', '<b>taxes on plane tickets</b>', '<b>publicly owned railway system</b>', '<b>secure tenancies for life</b>', '<b>market rent to high earners renting a council home</b>', '<b>regulation of gambling</b>', '<b>civil service redundancy payments</b>', '<b title="Including voting to maintain them">anti-terrorism laws</b>', '<b>Royal Mail</b>', '<b>pub landlords rent-only leases</b>', '<b>legal aid</b>', '<b>courts in secret sessions</b>', '<b>register of lobbyists</b>', '<b>no-win no fee cases</b>', '<b>letting agents</b>', '<b><a href="http://webarchive.nationalarchives.gov.uk/20100527091800/http://programmeforgovernment.hmg.gov.uk/">Conservative - Liberal Democrat Coalition Agreement</a></b>']
    #print descList

    for line in desc:
        li_list = line.findAll('li')
        for li in li_list:
            if len(li.findAll('b')) == 1:
                if li.find('b') in descList:
                    resultsList.append(str(li.find('a',{'class':"vote-description__evidence"})))
                    print li.find('a',{'class':"vote-description__evidence"})
            elif len(li.findAll('b')) == 2:
                print li.findAll('b')[1]
                if li.findAll('b')[1] in descList:
                    resultsList.append(str(li.find('a',{"class':'vote-description__evidence"})))
                    print li.find('a',{'class':"vote-description__evidence"})
            elif li.find('strong') in descList:
                resultsList.append(str(li.find('a',{"class':'vote-description__evidence"})))
                print li.find('a',{'class':"vote-description__evidence"})
            elif li.find('a') in descList:
                resultsList.append(str(li.find('a',{"class':'vote-description__evidence"})))
                print li.find('a',{'class':"vote-description__evidence"})
            else:
                resultsList.append('0')

  print resultsList 

votedescget("https://www.theyworkforyou.com/mp/10001/diane_abbott/hackney_north_and_stoke_newington/votes")

Usually the list is created programmatically from a file but for the sake of ease I've just included it as a variable. For some reason the result I'm getting when I run this code is as follows:

<b>assistance to end their life</b>
<b>council tax</b>
<b>assistance to end their life</b>
<b>over laws specifically impacting their part of the UK</b>
<b>electricity generation</b>
<b>no-win no fee cases</b>
<b>letting agents</b>
['0', '0', '0', '0']

Could anyone tell me why this is happening, or how to fix it? What I'm expecting is a list of zeroes interspersed with results where the tags are found in descList, but this isn't what's happening.

Upvotes: 0

Views: 525

Answers (1)

B.Adler
B.Adler

Reputation: 1539

In your comparison you are checking if li.find('b') in descList: Have you tested whether or not a navigable string can be compared to a string in this way? Beautiful soup returns a navigable string rather than a string, which is why you are type casting it to a string before you append it to your list; however, you are not type casting it before this comparison.

Upvotes: 1

Related Questions