Reputation: 1049
My whole application is a little sitemap-Scraper
, I feed it the root link, from here it will scan the site for more links, and then scrape those sites also for more links, kinda like a sitemap-gen just more verbose. The bigger picture is, is that there are sites containing youtube,facebook,google etc. These sites can lead to a further eternity and put my app into a limbo-chain, thus I decided I'll feed it a blocker so we can remove those bigger websites
I have a file called blocked_sites.txt
in which i have:
facebook
youtube
And I have a set
in which I have:
'facebook.com', 'youtube.com', 'gold'
So, what I want to do, is :
Point 1&2 I got done But the third one is a gotcha, this is what I preemptively tried:
# For every url in urls
for url in urls:
# For every blocker inside blocked
for blocker in blocked:
# If URL contains BLOCKER
if blocker in url:
# Remove THAT URL
urls.remove(url)
print('removed: ' + url)
print(urls)
The problem is that I can't really modify a set while iterating through it at the same time. So what are my options?
Heres what I thought:
URL
that DOESNT contain blocker and copy it to another set
--This seems a bit bulky, I mean, we would then have to deal with the urls,blocker, new_urls and doesn't seem as much of a good idea, especially if I am constantly feeding more and more links to the old list, doesn't seem very memory effecient{ 'item' }
as my set as opposed to [ 'item' ]
?Okay so take these first sets:
urls = {'facebook.com', 'youtube.com', 'gold'}
blocked = {'facebook'}
>> Set changed during iteration
alrighty, let's do it this-way:
urls = ['facebook.com', 'youtube.com', 'gold']
blocked = ['facebook']
>>> Removed: facebook
Yay it worked!
What if we add more blockers like so:
urls = ['facebook.com', 'youtube.com', 'gold']
blocked = ['facebook', 'youtube']
>>>Removed: facebook
['youtube.com', 'gold']
That's strange! For some reason, it can only take off one blocker?
How do I get to the gold?
Upvotes: 0
Views: 44
Reputation: 16623
Changing a list/set's content during iteration is typically a recipe for disaster. In almost all cases, it is better to construct a new list/set instead of operating in place. This is very simple with a comprehension:
urls = ['facebook.com', 'youtube.com', 'gold']
blocked = ['facebook', 'youtube']
urls = [url for url in urls if not any(blocker in url for blocker in blocked)]
print(urls)
# ['gold']
With sets:
urls = {'facebook.com', 'youtube.com', 'gold'}
blocked = {'facebook', 'youtube'}
urls = {url for url in urls if not any(blocker in url for blocker in blocked)}
print(urls)
# {'gold'}
However, do note that iterating through sets is quite slow and the option with lists is probably faster.
Upvotes: 1
Reputation: 14226
We can extend your approach a bit further to achieve what you want solely using set operations.
found = set()
urls = {'facebook.com', 'youtube.com', 'gold'}
blocked = {'facebook', 'youtube'}
for url in urls:
for blocker in blocked:
if blocker in url:
found.add(url)
urls.difference(found)
{'gold'}
Upvotes: 2