GiveItAwayNow
GiveItAwayNow

Reputation: 447

Parser returns wrong url

I'm parsing dialect words from http://www.dialettando.com/dizionario/hitlist_regioni_new.lasso?regione=Sardegna.

from urllib import request  

from bs4 import BeautifulSoup
from nltk import corpus, word_tokenize, FreqDist, ConditionalFreqDist

url = 'http://www.dialettando.com/dizionario/hitlist_regioni_new.lasso?regione=Sardegna'
dialettando_tokens = []

while url:
    html = request.urlopen(url).read().decode('utf8')
    page = BeautifulSoup(html, 'html.parser')
    a_list = page.find_all('a')
    for a in a_list:
        try:
            a_str = str(a.contents[0])
            if a_str[:3] == '<b>' and a.contents[0].string:
                dialettando_tokens.append(a.contents[0].string.strip())
        except:
            pass

        if a.string == 'Simonelli Editore Srl':
            break
        elif a.string == 'PROSSIMI':
            link = a['href']
            url = 'http://www.dialettando.com/dizionario/' + link
            break
        else:
            url = ''

In the end of each iteration I need to parse url to the next page. HTML:

<a href="hitlist_regioni_new.lasso?saltarec=20&ordina=parola_dialetto&regione=Sardegna" class="titolinoverdone">PROSSIMI</a>

And I need to get this link:

'hitlist_regioni_new.lasso?saltarec=20&ordina=parola_dialetto&regione=Sardegna' 

BUT the parser returns:

'hitlist_regioni_new.lasso?saltarec=20&ordina=parola_dialettoRione=Sardegna'

This link doesn't work correctly and I can't understand what's wrong.

Upvotes: 1

Views: 76

Answers (1)

maxpolk
maxpolk

Reputation: 2217

An href needs to have the ampersand character escaped, see this question. It is possible the site you visited is not escaping the ampersand inside the href correctly, and hoping they never accidentally reference an HTML entity, except in your case they did. It seems like you have to parse buggy HTML, plus a parser that didn't notice the semicolon was missing and did the HTML entity conversion anyway.

Upvotes: 1

Related Questions