How to exclude or remove the specific parts in Python

Question

I'd like to analyze the chat log below to get the most frequently used words. Therefore, the only parts I need are after [time] like [01:25]. How would I change?

+++

John, Max, Tracey with SuperChats

Date Saved : 2019-11-22 19:29:46

--------------- Tuesday, 9 July 2019 ---------------

[John] [00:27] Hi

[Max] [01:25] No

[Tracey] [02:31] Anybody has some bananas?

[Max] [04:39] No

[John] [20:58] Oh my goodness

--------------- Wednesday, 10 July 2019 ---------------

[Tracey] [14:33] Anybody has a mug?

[Max] [14:45] No

[John] [14:45] Oh my buddha

+++

from collections import Counter
import re

wordDict = Counter()
with open(r'C:chatlog.txt', 'r', encoding='utf-8') as f:
    chatline = f.readlines()
    chatline = [x.strip() for x in chatline]
    chatline = [x for x in chatline if x]

    for count in range(len(chatline)):
        if count < 2:
            continue
        elif '---------------' in chatline:
            continue

        re.split(r"$$\d{2}[:]\d{2}$$", x for x in chatline) #Maybe need to modify this part

print('Word', 'Frequency')
for word, freq in wordDict.most_common(50):
    print('{0:10s} : {1:3d}'.format(word, freq))

ggorlen · Accepted Answer

You can use the pattern /^.*?$$\d\d:\d\d$$\s*(.+)$/ to match the text after the relevant lines (I'd work line by line instead of slurping the file with f.readlines(), which isn't memory-friendly). There should be no need to specially handle anything else since the timestamp is quite unique, but it wouldn't hurt to throw in a test for the brackets that appear around the username at the beginning of the line if you wish.

import re
from collections import Counter

words = []

with open("chatlog.txt", "r", encoding="utf-8") as f:
    for line in f:
        m = re.search(r"^.*?$$\d\d:\d\d$$\s*(.+)$", line)

        if m:
            words.extend(re.split(r"\s+", m.group(1)))

for word, freq in Counter(words).most_common(50):
    print("{0:10s} : {1:3d}".format(word, freq))

Output:

No         :   3
Anybody    :   2
has        :   2
Oh         :   2
my         :   2
Hi         :   1
some       :   1
bananas?   :   1
goodness   :   1
a          :   1
mug?       :   1
buddha     :   1

As can be seen, stripping punctuation might also be worth doing. You could use something like

# ...
if m:
    no_punc = re.split(r"\W+", m.group(1))
    words.extend([x for x in no_punc if x])
# ...

How to exclude or remove the specific parts in Python

Answers (2)

Related Questions