Parser returns wrong url

Question

I'm parsing dialect words from http://www.dialettando.com/dizionario/hitlist_regioni_new.lasso?regione=Sardegna.

from urllib import request  

from bs4 import BeautifulSoup
from nltk import corpus, word_tokenize, FreqDist, ConditionalFreqDist

url = 'http://www.dialettando.com/dizionario/hitlist_regioni_new.lasso?regione=Sardegna'
dialettando_tokens = []

while url:
    html = request.urlopen(url).read().decode('utf8')
    page = BeautifulSoup(html, 'html.parser')
    a_list = page.find_all('a')
    for a in a_list:
        try:
            a_str = str(a.contents[0])
            if a_str[:3] == '' and a.contents[0].string:
                dialettando_tokens.append(a.contents[0].string.strip())
        except:
            pass

        if a.string == 'Simonelli Editore Srl':
            break
        elif a.string == 'PROSSIMI':
            link = a['href']
            url = 'http://www.dialettando.com/dizionario/' + link
            break
        else:
            url = ''

In the end of each iteration I need to parse url to the next page. HTML:

PROSSIMI

And I need to get this link:

'hitlist_regioni_new.lasso?saltarec=20&ordina=parola_dialetto®ione=Sardegna'

BUT the parser returns:

'hitlist_regioni_new.lasso?saltarec=20&ordina=parola_dialettoRione=Sardegna'

This link doesn't work correctly and I can't understand what's wrong.

maxpolk · Accepted Answer

An href needs to have the ampersand character escaped, see this question. It is possible the site you visited is not escaping the ampersand inside the href correctly, and hoping they never accidentally reference an HTML entity, except in your case they did. It seems like you have to parse buggy HTML, plus a parser that didn't notice the semicolon was missing and did the HTML entity conversion anyway.

Parser returns wrong url

Answers (1)

Related Questions