Reputation: 67
I'd like to analyze the chat log below to get the most frequently used words. Therefore, the only parts I need are after [time] like [01:25]. How would I change?
+++
John, Max, Tracey with SuperChats
Date Saved : 2019-11-22 19:29:46
--------------- Tuesday, 9 July 2019 ---------------
[John] [00:27] Hi
[Max] [01:25] No
[Tracey] [02:31] Anybody has some bananas?
[Max] [04:39] No
[John] [20:58] Oh my goodness
--------------- Wednesday, 10 July 2019 ---------------
[Tracey] [14:33] Anybody has a mug?
[Max] [14:45] No
[John] [14:45] Oh my buddha
+++
from collections import Counter
import re
wordDict = Counter()
with open(r'C:chatlog.txt', 'r', encoding='utf-8') as f:
chatline = f.readlines()
chatline = [x.strip() for x in chatline]
chatline = [x for x in chatline if x]
for count in range(len(chatline)):
if count < 2:
continue
elif '---------------' in chatline:
continue
re.split(r"\[\d{2}[:]\d{2}\]", x for x in chatline) #Maybe need to modify this part
print('Word', 'Frequency')
for word, freq in wordDict.most_common(50):
print('{0:10s} : {1:3d}'.format(word, freq))
Upvotes: 2
Views: 117
Reputation: 56885
You can use the pattern /^.*?\[\d\d:\d\d\]\s*(.+)$/
to match the text after the relevant lines (I'd work line by line instead of slurping the file with f.readlines()
, which isn't memory-friendly). There should be no need to specially handle anything else since the timestamp is quite unique, but it wouldn't hurt to throw in a test for the brackets that appear around the username at the beginning of the line if you wish.
import re
from collections import Counter
words = []
with open("chatlog.txt", "r", encoding="utf-8") as f:
for line in f:
m = re.search(r"^.*?\[\d\d:\d\d\]\s*(.+)$", line)
if m:
words.extend(re.split(r"\s+", m.group(1)))
for word, freq in Counter(words).most_common(50):
print("{0:10s} : {1:3d}".format(word, freq))
Output:
No : 3
Anybody : 2
has : 2
Oh : 2
my : 2
Hi : 1
some : 1
bananas? : 1
goodness : 1
a : 1
mug? : 1
buddha : 1
As can be seen, stripping punctuation might also be worth doing. You could use something like
# ...
if m:
no_punc = re.split(r"\W+", m.group(1))
words.extend([x for x in no_punc if x])
# ...
Upvotes: 2
Reputation: 731
Try using split like this
lines = ["[Tracey] [02:31] Anybody has some bananas?","[John] [20:58] Oh my goodness"]
for i in lines:
print(i.split(' ')[2:])
Upvotes: 0