Removing characters from each item in a list and counting the same items

Question

I have a text file that each row has an HTTP request. First, I created a list from the text file and now trying to count how many times a domain sent a request. Each row has the full URL so I need to get rid of anything after ".com" to keep the domains only and count the total number of requests made by that domain. For instance, based on the list below, the output would be

'https:/news.com': 4
'https:/recipes.com': 4

'https:/books.com': 3

my_list = ['https:/news.com/main', 'https:/recipes.com/main', 
'https:/news.com/summary', 'https:/recipes.com/favorites', 
'https:/news.com/today', 'https:/recipes.com/book', 
'https:/news.com/register', 'https:/recipes.com/', 
'https:/books.com/main', 'https:/books.com/favorites', 
'https:/books.com/sale']

cs95 · Accepted Answer

You could do this using re and a Counter -

Extract domains with re.match
Pass the expression to the Counter constructor

from collections import Counter
import re

c = Counter(re.match('.*com', i).group(0) for i in my_list)

print(c)
Counter({'https:/books.com': 3, 'https:/news.com': 4, 'https:/recipes.com': 4})

Do note that re.match in a (generator) comprehension cannot handle errors (which might occur if your list contains an invalid URL). In that case, you might consider using a loop -

r = []
for i in my_list:
    try:
        r.append(re.match('.*com', i).group(0))
    except AttributeError:
        pass

c = Counter(r)

Removing characters from each item in a list and counting the same items

Answers (1)

Related Questions