Reputation: 13
I have a text file that each row has an HTTP request. First, I created a list from the text file and now trying to count how many times a domain sent a request. Each row has the full URL so I need to get rid of anything after ".com" to keep the domains only and count the total number of requests made by that domain. For instance, based on the list below, the output would be
'https:/books.com': 3
my_list = ['https:/news.com/main', 'https:/recipes.com/main',
'https:/news.com/summary', 'https:/recipes.com/favorites',
'https:/news.com/today', 'https:/recipes.com/book',
'https:/news.com/register', 'https:/recipes.com/',
'https:/books.com/main', 'https:/books.com/favorites',
'https:/books.com/sale']
Upvotes: 1
Views: 29
Reputation: 402932
You could do this using re
and a Counter
-
re.match
Counter
constructorfrom collections import Counter
import re
c = Counter(re.match('.*com', i).group(0) for i in my_list)
print(c)
Counter({'https:/books.com': 3, 'https:/news.com': 4, 'https:/recipes.com': 4})
Do note that re.match
in a (generator) comprehension cannot handle errors (which might occur if your list contains an invalid URL). In that case, you might consider using a loop -
r = []
for i in my_list:
try:
r.append(re.match('.*com', i).group(0))
except AttributeError:
pass
c = Counter(r)
Upvotes: 1