Reputation: 1545
For the first print tag
I am getting a large list of hundreds of <a
tags. For the second print tag
I am getting a list with four <a
tags, not including the ones that I want.
One of the tags that tags that I want is at the end of tags
. After printing all several hundred tags, I am printing the last tag, and that is printing the correct end tag as it should. But then by running another for loop over the same (unchanged) list tags
I am not just getting a different result, but significantly different.
With or without the `print '\n\n\n' the phenomenon is happening, it's just to make the split between the two prints easier for me to see.
What is happening to this list in between the first and second for
loop to cause this problem?
(This code is exactly as I have it in my script. Originally I didn't have the lines from the first for
loop until the empty line, and am doing this to debug the lack of the correct URL from the end result.)
EDIT: Also, here is what is being printed for all the print
statements (only the last section of the first print
within the for
loop):
import urllib
from bs4 import BeautifulSoup
startingList = ['http://www.stowefamilylaw.co.uk/']
for url in startingList:
try:
html = urllib.urlopen(url)
soup = BeautifulSoup(html,'lxml')
tags = soup('a')
for tag in tags:
print tag
print tags[-1]
print '\n\n\n'
for tag in tags:
print tag
if not tag.get('href', None).startswith('..'):
continue
except:
continue
....
<a class="shiftnav-target" href="http://www.stowefamilylaw.co.uk/faq-category/decrees-orders-forms/" itemprop="url">Decrees, Orders & Forms</a>
<a class="shiftnav-target" href="http://www.stowefamilylaw.co.uk/faq-category/international-divorce/" itemprop="url">International Divorce</a>
<a class="shiftnav-target"><i class="fa fa-chevron-left"></i> Back</a>
<a class="shiftnav-target" href="http://www.stowefamilylaw.co.uk/contact/" itemprop="url"><i class="fa fa-phone"></i> Contact</a>
<a class="shiftnav-target" href="http://www.stowefamilylaw.co.uk/contact/" itemprop="url"><i class="fa fa-phone"></i> Contact</a>
<a href="http://www.stowefamilylaw.co.uk/">Stowe Family Law</a>
<a href="#spu-5086" style="color: #fff"><div class="callbackbutton"><i class="fa fa-phone" style="font-size: 16px"></i> Request Callback </div></a>
<a href="#spu-5084" style="color: #fff"><div class="callbackbutton"><i class="fa fa-envelope-o" style="font-size: 16px"></i> Quick Enquiry </div></a>
<a class="ubermenu-responsive-toggle ubermenu-responsive-toggle-main ubermenu-skin-black-white-2 ubermenu-loc-primary" data-ubermenu-target="ubermenu-main-3-primary"><i class="fa fa-bars"></i>Main Menu</a>
Upvotes: 0
Views: 338
Reputation: 1122082
You have a blanket except:
:
try:
# ...
except:
continue
so any error in the block will be masked and your loop will be skipped. Don't use blanket except handlers without raising again, ever, see Why is "except: pass" a bad programming practice?. At the very least catch only Exception
and print that error:
except Exception as e:
print 'Encountered:', e
Without proper diagnostics all we can do is guess.
One error you definitely have is an attribute error here when there is no href
attribute; the None
object doesn't have an attribute startswith
:
if not tag.get('href', None).startswith('..'):
Instead of None
return an empty string:
if not tag.get('href', '').startswith('..'):
or better yet, select only a
tags with an href
attribute:
tags = soup.select('a[href]')
Upvotes: 3