Reputation: 23
I need some help adding a list to another list while checking for duplicates. I only want to add items to my base list that are not already there.
I cannot do this using sets because the items in the base list are also lists.
An example of my base list is as follows:
toCrawl=[["http://website.html",0]["http://websiteAlt.html",1]["http://websiteAlt.html",1]]
The list that I want to add to this is as follows:
newLinks=["http://websiteAlt.html","http://websiteExample.html","http://websiteExampleAlt.html"]
So I want to add the 'newLinks' list to the base 'toCrawl' list, however I only want to add it if the item in newLinks is not already in toCrawl.
As well as this I also want to add the items from 'newLinks' to the 'toCrawl' list as a list. So rather than adding the item in 'newLinks' as: "http://websiteExample.html"
I want to add it to the list as a list for example: ["http://websiteExample.html",0]
Upvotes: 0
Views: 223
Reputation: 23
Dictionary was a good shout thanks. I ended up going with this method however:
for link in newLinks: #check every link in 'newLinks'
if link not in toCrawl: #if the link is not in 'toCrawl'...
toCrawl.append([link,depthFound+1]) #add the link to 'toCrawl' with the 'depthFound'
Upvotes: 0
Reputation: 10017
A nice solution would be to use list comprehension and cast your list as a set:
toCrawl=[["http://website.html",0],["http://websiteAlt.html",1],["http://websiteAlt.html",1]]
newLinks = set([item[0] for item in toCrawl])
print(newLinks)
Output
{'http://website.html', 'http://websiteAlt.html'}
Note that in order to remove duplicates, sets seems to be the good pratice, this is from the documentation:
A set object is an unordered collection of distinct hashable objects. Common uses include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference. (For other containers see the built-in dict, list, and tuple classes, and the collections module.)
Upvotes: 1
Reputation: 368
Could this be done with a dictionary instead of a list?
toCrawlDict = dict(toCrawl)
for link in newLinks:
if link not in toCrawlDict:
toCrawlDict[link] = 0
Upvotes: 1