Reputation: 3

Count the word with apostrophe as one word BUT returns two pieces of words (python)

I am new in programming and I want to make a program that can count the frequency of words from a file. The expected output is as follows:

WORD FREQUENCY

in - 1
many - 1
other - 1
programming - 1
languages - 1
you - 1
would - 1
use - 1
a - 4
type - 1
called - 1
list’s - 1
TOTAL = x

I've almost got it working, but the word "list's" returns something like this:

list**â**  -  1
s  -  1

affecting the number of total words from the file.

I've been using regex like this:

match_pattern = re.findall(r"\w+", infile)

Upvotes: 0

Answers (3)

user11281370

Reputation:

This is a solution that does not use regex.

I am assuming there are multiple sentences in the file. Take the whole content as docstring and use str.split() function with split by space. You will get a list of words in that string.

Next you can use collections.Counter(list) to get a dictionary which has keys as words and values as their frequency.

from collections import Counter
with open('file.txt') as f:
  a = f.read()
b = dict(Counter(a.split(by = ' ')))

b is dictionary with the word-frequency pairs.

Note - Periods will always be kept with the last word in the sentence. You can ignore them in the results, or you can remove all periods first and then do the above procedure. Then the '.' used in abbreviations will also be removed, so it may not work like you want.

If you still want to use regex and match letters and apostrophe, try r"[a-zA-Z']+" and then use Counter. I will try to post some code for it when I get some time.

Upvotes: 0

Emma

Reputation: 27733

I'm guessing that a simple expression with a defaultdict might work:

import re
from collections import defaultdict

regex = r"(\b\w+\b)"
test_str = "some words before alice and bob Some WOrdS after Then repeat some words before Alice and BOB some words after then repeat"
matches = re.findall(regex, test_str)
print(matches)

words_dictionary = defaultdict(int)
for match in matches:
    words_dictionary[match]+=1

print(words_dictionary)

Normal Output

['some', 'words', 'before', 'alice', 'and', 'bob', 'Some', 'WOrdS', 'after', 'Then', 'repeat', 'some', 'words', 'before', 'Alice', 'and', 'BOB', 'some', 'words', 'after', 'then', 'repeat']

defaultdict(<class 'int'>, {'some': 3, 'words': 3, 'before': 2, 'alice': 1, 'and': 2, 'bob': 1, 'Some': 1, 'WOrdS': 1, 'after': 2, 'Then': 1, 'repeat': 2, 'Alice': 1, 'BOB': 1, 'then': 1})

Test with `lower()`

import re
from collections import defaultdict

regex = r"(\b\w+\b)"
test_str = "some words before alice and bob Some WOrdS after Then repeat some words before Alice and BOB some words after then repeat"
matches = re.findall(regex, test_str)
print(matches)

words_dictionary = defaultdict(int)
for match in matches:
    words_dictionary[match.lower()]+=1

print(words_dictionary)

Output with `lower()`

defaultdict(<class 'int'>, {'some': 4, 'words': 4, 'before': 2, 'alice': 2, 'and': 2, 'bob': 2, 'after': 2, 'then': 2, 'repeat': 2})

The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.

for key,value in words_dictionary.items():
    print(f'{key} - {value}')

Output

some - 4
words - 4
before - 2
alice - 2
and - 2
bob - 2
after - 2
then - 2
repeat - 2

Upvotes: 1

user11116003

Reputation:

Instead of using:

match_pattern = re.findall(r"\w+", infile)

Try use:

match_pattern = re.findall(r"\S+", infile)

\w maens a-z A-Z _ 0-9

\S means any non space character.

Upvotes: 0

Count the word with apostrophe as one word BUT returns two pieces of words (python)

Answers (3)

Normal Output

Test with lower()

Output with lower()

Output

Related Questions

Test with `lower()`

Output with `lower()`