Aventinus
Aventinus

Reputation: 1382

How can I write the following code in a more efficient and pythonic way?

I have a list with urls: file_url_list, which prints to this:

www.latimes.com, www.facebook.com, affinitweet.com, ...

And another list of the Top 1M urls: top_url_list, which prints to this:

[1, google.com], [2, www.google.com], [3, microsoft.com], ...

I want to find how many URLs in file_url_list are in top_url_list. I have written the following code which works, but I know that it's not the fastest way to do it, nor the most pythonic one.

# Find the common occurrences
found = []
for file_item in file_url_list:
    for top_item in top_url_list:
        if file_item == top_item[1]:
            # When you find an occurrence, put it in a list
            found.append(top_item)

How can I write this in a more efficient and pythonic way?

Upvotes: 3

Views: 114

Answers (3)

Open AI - Opting Out
Open AI - Opting Out

Reputation: 24133

You say you want to know how many urls from the file are in the top 1m list, not what they actually are. Build a set of the larger list (I assume it will be the 1m), and then iterate through the other list counting whether each is in the set:

top_urls = {url for (index, url) in top_url_list}
total = sum(url in top_urls for url in file_url_list)

If the file list is larger build the set from that instead:

file_urls = set(file_url_list)
total = sum(url in file_urls for index, url in top_url_list)

sum will add together numbers. url in top_urls evaluates to a bool, either True or False. This gets converted to an integer, 1 or 0 respectively. url in top_urls for url in file_url_list effectively generates a sequence of 1 or 0 for sum.

Perhaps slightly more efficient (I'd have to test it), you could filter and only sum 1s if url in top_urls:

total = sum(1 for url in file_url_list if url in top_urls)

Upvotes: 2

Chankey Pathak
Chankey Pathak

Reputation: 21666

You could take URLs from second list and then either use set as Kos has shown in his answer, or you can use lambda with filter.

top_url_list_flat = [item[1] for item in top_url_list]
print filter(lambda url: url in file_url_list, top_url_list_flat)

In Python 3 filter returns an object which is iterable, so you will have to do below:

for common in (filter(lambda url: url in file_url_list, top_url_list_flat)):
    print (common)

Demo

Upvotes: 1

Kos
Kos

Reputation: 72241

Set intersection should help. Additionally, you can use a generator expression to extract just the url from each entry in top_url_list.

file_url_list = ['www.latimes.com', 'www.facebook.com', 'affinitweet.com']
top_url_list = [[1, 'google.com'], [2, 'www.google.com'], [3, 'microsoft.com']]

common_urls = set(file_url_list) & set(url for (index, url) in top_url_list)

or equivalently thanks to Jean-François Fabre:

common_urls = set(file_url_list) & {url for (index, url) in top_url_list}

Upvotes: 7

Related Questions