Reputation: 447
I'm parsing dialect words from http://www.dialettando.com/dizionario/hitlist_regioni_new.lasso?regione=Sardegna.
from urllib import request
from bs4 import BeautifulSoup
from nltk import corpus, word_tokenize, FreqDist, ConditionalFreqDist
url = 'http://www.dialettando.com/dizionario/hitlist_regioni_new.lasso?regione=Sardegna'
dialettando_tokens = []
while url:
html = request.urlopen(url).read().decode('utf8')
page = BeautifulSoup(html, 'html.parser')
a_list = page.find_all('a')
for a in a_list:
try:
a_str = str(a.contents[0])
if a_str[:3] == '<b>' and a.contents[0].string:
dialettando_tokens.append(a.contents[0].string.strip())
except:
pass
if a.string == 'Simonelli Editore Srl':
break
elif a.string == 'PROSSIMI':
link = a['href']
url = 'http://www.dialettando.com/dizionario/' + link
break
else:
url = ''
In the end of each iteration I need to parse url to the next page. HTML:
<a href="hitlist_regioni_new.lasso?saltarec=20&ordina=parola_dialetto®ione=Sardegna" class="titolinoverdone">PROSSIMI</a>
And I need to get this link:
'hitlist_regioni_new.lasso?saltarec=20&ordina=parola_dialetto®ione=Sardegna'
BUT the parser returns:
'hitlist_regioni_new.lasso?saltarec=20&ordina=parola_dialettoRione=Sardegna'
This link doesn't work correctly and I can't understand what's wrong.
Upvotes: 1
Views: 76
Reputation: 2217
An href needs to have the ampersand character escaped, see this question. It is possible the site you visited is not escaping the ampersand inside the href correctly, and hoping they never accidentally reference an HTML entity, except in your case they did. It seems like you have to parse buggy HTML, plus a parser that didn't notice the semicolon was missing and did the HTML entity conversion anyway.
Upvotes: 1