Reputation: 301
I'm trying to store the emails gotten from certain links. I encounter two problems. The first one is that for some reason the element email stores two of the same type of item. And the second problem is that the if statement detects that email has a value but it doesn't store it in the emails list. Thank you for helping out!
emails = []
comment = []
with open('comment.txt', 'r') as filehandle:
for line in filehandle:
currentPlace = line[:-1]
comment.append(currentPlace)
print(emails)
i = 0
while i < len(comment) :
url = str(comment[i]) + '/about'
print("Crawling URL %s" % url)
response = requests.get(url)
email = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I)
print(email)
if email:
emails.append(email)
email.clear()
i += 1
time.sleep(0.2)
print(emails)
Output:
[]
Crawling URL ...
['[email protected]', '[email protected]']
Crawling URL ...
[]
Crawling URL ...
['[email protected]', '[email protected]']
Crawling URL ...
[]
Crawling URL ...
[]
[[], []]
old code outputs correctly:
emails = set()
print("Crawling URL %s" % starting_url)
response = requests.get(starting_url)
new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
emails.update(new_emails)
print(emails)
# create a beutiful soup for the html document
soup = BeautifulSoup(response.text, 'lxml')
Upvotes: 1
Views: 135
Reputation: 6190
https://docs.python.org/3/library/re.html#re.findall this returns the list of all matches of your regular expression. So the regular expression is finding 2 matches for your email regexp.
You then do emails.append(email)
. But email
is itself a list of emails. So your emails
list ends up looking like [["[email protected]","[email protected]"], ["[email protected]","[email protected]"], ... ]
.
Upvotes: 2