Appending a list to another list while checking for duplicates

I need some help adding a list to another list while checking for duplicates. I only want to add items to my base list that are not already there.

I cannot do this using sets because the items in the base list are also lists.

An example of my base list is as follows:

toCrawl=[["http://website.html",0]["http://websiteAlt.html",1]["http://websiteAlt.html",1]]

The list that I want to add to this is as follows:

newLinks=["http://websiteAlt.html","http://websiteExample.html","http://websiteExampleAlt.html"]

So I want to add the 'newLinks' list to the base 'toCrawl' list, however I only want to add it if the item in newLinks is not already in toCrawl.

As well as this I also want to add the items from 'newLinks' to the 'toCrawl' list as a list. So rather than adding the item in 'newLinks' as: "http://websiteExample.html" I want to add it to the list as a list for example: ["http://websiteExample.html",0]

Upvotes: 0

Views: 223

Answers (3)

Dictionary was a good shout thanks. I ended up going with this method however:

for link in newLinks:   #check every link in 'newLinks'
            if link not in toCrawl: #if the link is not in 'toCrawl'...
                toCrawl.append([link,depthFound+1]) #add the link to 'toCrawl' with the 'depthFound'

Upvotes: 0

scharette
scharette

Reputation: 10017

A nice solution would be to use list comprehension and cast your list as a set:

toCrawl=[["http://website.html",0],["http://websiteAlt.html",1],["http://websiteAlt.html",1]]
newLinks = set([item[0] for item in toCrawl])
print(newLinks)

Output

{'http://website.html', 'http://websiteAlt.html'}

Note that in order to remove duplicates, sets seems to be the good pratice, this is from the documentation:

A set object is an unordered collection of distinct hashable objects. Common uses include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference. (For other containers see the built-in dict, list, and tuple classes, and the collections module.)

Upvotes: 1

smolloy
smolloy

Reputation: 368

Could this be done with a dictionary instead of a list?

toCrawlDict = dict(toCrawl)
for link in newLinks:
    if link not in toCrawlDict:
         toCrawlDict[link] = 0

Upvotes: 1

Related Questions