Reputation: 41
[Task]
Write a program to read through the a text file and figure out the distribution by hour of the day for each of the messages. You can pull the hour out from the 'From ' line by finding the time and then splitting the string a second time using a colon.
Example of a line of the text file:
"From [email protected] Sat Jan 5 09:14:16 2015"
Once you have accumulated the counts for each hour, print out the counts, sorted by hour as shown below.
[Expected result]
04 3
06 1
07 1
09 2
10 3
11 6
14 1
15 2
16 4
17 2
18 1
19 1
This means that I need to pull out the "09:14:16" portion and then pull out the hour "09" once more.
I will use '#' to comment what I've done below
[My code]
name = raw_input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt" #if nothing is entered by user, it goes straight to the desired file
handle = open(name, 'r') # open and read the file
count = dict() # initialise count to a empty dictionary
for text in handle: #for loop to loop through lines in the file
text = text.rstrip() #r.strip() to to remove any newline "\n"
if not text.startswith('From '): continue # find lines that starts with "From "
text = text.split() #split the line into list of words
line = text[5] #time is located at the [5] index
time = line.split(':') #split once more to get the hour
hour = time[0] #hour is on the [0] index
count[hour] = count.get(hour, 0) + 1
print count
[My result]
{'09': 1} ← Mismatch
{'09': 1, '18': 1}
{'09': 1, '18': 1, '16': 1}
{'09': 1, '18': 1, '16': 1, '15': 1}
{'09': 1, '18': 1, '16': 1, '15': 2}
{'09': 1, '18': 1, '16': 1, '15': 2, '14': 1}
{'09': 1, '18': 1, '16': 1, '15': 2, '14': 1, '11': 1}
{'09': 1, '18': 1, '16': 1, '15': 2, '14': 1, '11': 2}
{'09': 1, '18': 1, '16': 1, '15': 2, '14': 1, '11': 3}
(deleted portion of the result)
{'09': 2, '18': 1, '16': 1, '15': 2, '14': 1, '11': 6, '10': 3, '07': 1, '06': 1, '04': 3, '19': 1}
{'09': 2, '18': 1, '16': 1, '15': 2, '14': 1, '11': 6, '10': 3, '07': 1, '06': 1, '04': 3, '19': 1, '17': 1}
{'09': 2, '18': 1, '16': 1, '15': 2, '14': 1, '11': 6, '10': 3, '07': 1, '06': 1, '04': 3, '19': 1, '17': 2}
{'09': 2, '18': 1, '16': 2, '15': 2, '14': 1, '11': 6, '10': 3, '07': 1, '06': 1, '04': 3, '19': 1, '17': 2}
{'09': 2, '18': 1, '16': 3, '15': 2, '14': 1, '11': 6, '10': 3, '07': 1, '06': 1, '04': 3, '19': 1, '17': 2}
{'09': 2, '18': 1, '16': 4, '15': 2, '14': 1, '11': 6, '10': 3, '07': 1, '06': 1, '04': 3, '19': 1, '17': 2}
Can someone help me where did I go wrong ? Am I heading in the right direction ? Appreciate any feedback and suggestions, im new to programming please be gentle and sorry for any formatting errors.
Upvotes: 2
Views: 619
Reputation: 350
import re
import collections
name = raw_input("Enter file:")
if not name: name = "mbox-short.txt"
with open(name) as handle:
hours = re.findall(r'^From .*(\d{2}):\d{2}:\d{2}', handle.read(), re.M)
count = sorted(collections.Counter(hours).items(), key=lambda x: int(x[0]))
for h, c in count:
print h, c
Upvotes: 0
Reputation: 2274
Your problem is that you're printing a dictionary, and dictionaries are not sorted in Python (actually they are, but not by their key values so it's a moot point).
You can solve this issue by sorting the dictionary keys before printing the results, as has been suggested. Personally though, I'm not sure it's the best solution.
The reason is that you're dealing with numbers. What's more, you're dealing with numbers from in [0, 23] range. This literally screams "use lists!" to me. :-)
So instead of using a dict(), try using:
# count = dict()
count = [0] * 24
This will create a list with 24 items, with indexes from 0 to 23.
Now, what you get from your string parsing are strings as well, so you'll need to convert them to numbers:
# count[hour] = count.get(hour, 0) + 1
count[int(hour)] += 1
Note how getting a hour which cannot be converted to integer or doesn't fall into 0..23 range will work with a dict but fail with a pre-initialized list. This is actually good: code which receives bad input and uses it to generate bad output without raising complaints is poor code. Of course, code which just throws exceptions is not very good code either, but it's a step in the right direction.
Of course, another issue arises: if you print a dict, both it's keys and values are printed. If you print a list, only values are printed. So we need to change the output code to:
for hour, amount in enumerate(count):
print hour, ':', amount
Next point I'd like to address in your code is: are you absolutely sure your email addresses will contain no spaces? There's always a chance that your code will once encounter a line like the following:
From: "Bob Fisher" <[email protected]> Sat Jan 5 09:14:16 2015
Essentially, your string looks like it's tail has more regular and predictable format than it's head. Which means it would be more reliable to retrieve time using slightly different syntax:
# line = text[5]
line = text[-2] # We take 2nd element from the end of string instead
It would probably be more generic to use a regular expression, but that's a more advanced topic which I'll leave uncovered here: if you know regexes, you'll be able to do it easily, and if you don't, you'll be better off with a proper introduction instead of whatever I'd be able to cobble here.
Another nitpick: I notice that you're not closing your file handle. It's not a big issue here since your program terminates anyway and any file handles which are still open will be closed automatically. In a larger project this can however lead to problems. Your code may be called by some other code, and if your code generates an exception and this exception is processed or suppressed by the caller, file handle will remain open. Repeat it enough times, and the program will exceed OS limit for maximum number of open files.
So I would recommend using slightly different syntax to open the file:
with open(name, 'r') as handle:
for text in handle:
# ...
The advantage of this syntax is that 'with' will correctly close your file handle, no matter what happens in the code below it. Even if an exception occurs, the file will still be correctly closed.
And the code so far would look like:
name = raw_input("Enter file:")
if not name: name = "mbox-short.txt" # cleaner check for empty string
count = [0] * 24 # use pre-initialized list instead of dict
with open(name, 'r') as handle: # use safer syntax to open files
for text in handle:
text = text.rstrip()
if not text.startswith('From '): continue
text = text.split()
line = text[-2] # use 2nd item from the end, just to be safe
time = line.split(':')
hour = int(time[0]) # we treat hour as integer
count[hour] += 1 # nicer looking
for hour, amount in enumerate(count):
if amount: # Only print hours with non-zero counters
print hour, ':', amount
Now, there are ways to decrease it's size at least by half (and probably more), but I've been trying to keep everything simple and true to the spirit of your original code.
Upvotes: 0
Reputation: 337
I think if you literally want that output, instead of "print count" at the end you need (outside the loop):
for a in sorted(count.keys()):
print a,count[a]
Upvotes: 0
Reputation: 85512
Remove print count
and at the end and outside the loop add these lines:
for key in sorted(count.keys()):
print key, count[key]
Upvotes: 1
Reputation: 5812
Since datetime has always the same format you can use dummy method:
your_string[-13:11] # your hour
where your_string is the one you paste, but every single text, which will contain full datetime would be valid for this operation.
Upvotes: 0