Reputation: 1382
I have a list with urls: file_url_list
, which prints to this:
www.latimes.com, www.facebook.com, affinitweet.com, ...
And another list of the Top 1M urls: top_url_list
, which prints to this:
[1, google.com], [2, www.google.com], [3, microsoft.com], ...
I want to find how many URLs in file_url_list
are in top_url_list
. I have written the following code which works, but I know that it's not the fastest way to do it, nor the most pythonic one.
# Find the common occurrences
found = []
for file_item in file_url_list:
for top_item in top_url_list:
if file_item == top_item[1]:
# When you find an occurrence, put it in a list
found.append(top_item)
How can I write this in a more efficient and pythonic way?
Upvotes: 3
Views: 114
Reputation: 24133
You say you want to know how many urls from the file are in the top 1m list, not what they actually are. Build a set of the larger list (I assume it will be the 1m), and then iterate through the other list counting whether each is in the set:
top_urls = {url for (index, url) in top_url_list}
total = sum(url in top_urls for url in file_url_list)
If the file list is larger build the set from that instead:
file_urls = set(file_url_list)
total = sum(url in file_urls for index, url in top_url_list)
sum
will add together numbers. url in top_urls
evaluates to a bool
, either True
or False
. This gets converted to an integer, 1
or 0
respectively. url in top_urls for url in file_url_list
effectively generates a sequence of 1
or 0
for sum
.
Perhaps slightly more efficient (I'd have to test it), you could filter and only sum 1
s if url in top_urls
:
total = sum(1 for url in file_url_list if url in top_urls)
Upvotes: 2
Reputation: 21666
You could take URLs from second list and then either use set
as Kos has shown in his answer, or you can use lambda with filter.
top_url_list_flat = [item[1] for item in top_url_list]
print filter(lambda url: url in file_url_list, top_url_list_flat)
In Python 3 filter
returns an object which is iterable, so you will have to do below:
for common in (filter(lambda url: url in file_url_list, top_url_list_flat)):
print (common)
Upvotes: 1
Reputation: 72241
Set intersection should help. Additionally, you can use a generator expression to extract just the url from each entry in top_url_list
.
file_url_list = ['www.latimes.com', 'www.facebook.com', 'affinitweet.com']
top_url_list = [[1, 'google.com'], [2, 'www.google.com'], [3, 'microsoft.com']]
common_urls = set(file_url_list) & set(url for (index, url) in top_url_list)
or equivalently thanks to Jean-François Fabre:
common_urls = set(file_url_list) & {url for (index, url) in top_url_list}
Upvotes: 7