Reputation: 61
I dont wanna have a email address twice, with this code I get the error TypeError: unhashable type: 'list' So i assume that the line allLinks= set() is wrong and I have to use a tuple and not a list, is that right?
Thats my code:
import requests
from bs4 import BeautifulSoup as soup
def get_emails(_links:list):
for i in range(len(_links)):
new_d = soup(requests.get(_links[i]).text, 'html.parser').find_all('a', {'class':'my_modal_open'})
if new_d:
yield new_d[-1]['title']
start = 20
while True:
d = soup(requests.get('http://www.schulliste.eu/type/gymnasien/?bundesland=&start={page_id}'.format(page_id=start)).text, 'html.parser')
results = [i['href'] for i in d.find_all('a')][52:-9]
results = [link for link in results if link.startswith('http://')]
next_page=d.find('div', {'class': 'paging'}, 'weiter')
if next_page:
start+=20
else:
break
allLinks= set()
if results not in allLinks:
print(list(get_emails(results)))
allLinks.add(results)
Upvotes: 2
Views: 634
Reputation: 61
I got it working but I still get duplicate emails.
allLinks = []
if results not in allLinks:
print(list(get_emails(results)))
allLinks.append((results))
Does anybody know why?
Upvotes: 0
Reputation: 222852
You're trying to add an entire list of emails as a single entry in the set
.
What you want is to add the actual emails each one in a separate set
entry.
The problem is in this line:
allLinks.add(results)
It adds the entire results
list as a single element in the set
and that doesn't work. Use this instead:
allLinks.update(results)
This will update the set
with elements from the list
, but each element will be a separate entry in the set
.
Upvotes: 1