EricTalv
EricTalv

Reputation: 1049

Remove an item in one set that has an item containing the other sets item

My whole application is a little sitemap-Scraper, I feed it the root link, from here it will scan the site for more links, and then scrape those sites also for more links, kinda like a sitemap-gen just more verbose. The bigger picture is, is that there are sites containing youtube,facebook,google etc. These sites can lead to a further eternity and put my app into a limbo-chain, thus I decided I'll feed it a blocker so we can remove those bigger websites

I have a file called blocked_sites.txt in which i have:

facebook
youtube

And I have a set in which I have:

'facebook.com', 'youtube.com', 'gold'

So, what I want to do, is :

  1. Compare Both lists items to one another
  2. Check if urls-item CONTAINS blocked_sites item
  3. Remove That item if it contains BLOCKED item

Point 1&2 I got done But the third one is a gotcha, this is what I preemptively tried:

 # For every url in urls
 for url in urls:
   # For every blocker inside blocked
   for blocker in blocked:      
      # If URL contains BLOCKER
      if blocker in url:
         # Remove THAT URL
         urls.remove(url)
         print('removed: ' + url)
print(urls)

The problem is that I can't really modify a set while iterating through it at the same time. So what are my options?

Heres what I thought:

  1. Take the URL that DOESNT contain blocker and copy it to another set --This seems a bit bulky, I mean, we would then have to deal with the urls,blocker, new_urls and doesn't seem as much of a good idea, especially if I am constantly feeding more and more links to the old list, doesn't seem very memory effecient
  2. Let's try and convert them into a list! --Hey! It worked! for like only 3 items? --On further look, a set already is a list? yet, I got an error when I was using { 'item' } as my set as opposed to [ 'item' ]?

Okay so take these first sets:

urls = {'facebook.com', 'youtube.com', 'gold'}
blocked = {'facebook'}
>> Set changed during iteration

alrighty, let's do it this-way:

urls = ['facebook.com', 'youtube.com', 'gold']
blocked = ['facebook']
>>> Removed: facebook

Yay it worked!

What if we add more blockers like so:

urls = ['facebook.com', 'youtube.com', 'gold']
blocked = ['facebook', 'youtube']
>>>Removed: facebook
   ['youtube.com', 'gold']

That's strange! For some reason, it can only take off one blocker?

How do I get to the gold?

Upvotes: 0

Views: 44

Answers (2)

iz_
iz_

Reputation: 16623

Changing a list/set's content during iteration is typically a recipe for disaster. In almost all cases, it is better to construct a new list/set instead of operating in place. This is very simple with a comprehension:

urls = ['facebook.com', 'youtube.com', 'gold']
blocked = ['facebook', 'youtube']

urls = [url for url in urls if not any(blocker in url for blocker in blocked)]
print(urls)
# ['gold']

With sets:

urls = {'facebook.com', 'youtube.com', 'gold'}
blocked = {'facebook', 'youtube'}

urls = {url for url in urls if not any(blocker in url for blocker in blocked)}
print(urls)
# {'gold'}

However, do note that iterating through sets is quite slow and the option with lists is probably faster.

Upvotes: 1

gold_cy
gold_cy

Reputation: 14226

We can extend your approach a bit further to achieve what you want solely using set operations.

found = set()
urls = {'facebook.com', 'youtube.com', 'gold'}
blocked = {'facebook', 'youtube'}

for url in urls:
    for blocker in blocked:
        if blocker in url:
            found.add(url)

urls.difference(found)

{'gold'}

Upvotes: 2

Related Questions