Miguel 2488
Miguel 2488

Reputation: 1440

Removing punctuation and creating a dictionary Python

I'm trying to create a function that removes punctuation and lowercases every letter in a string. Then, it should return all this in the form of a dictionary that counts the word frequency in the string.

This is the code I wrote so far:

def word_dic(string):
    string = string.lower()
    new_string = string.split(' ')
    result = {}

    for key in new_string:
        if key in result:
            result[key] += 1
        else:
            result[key] = 1

    for c in result:
        "".join([ c if not c.isalpha() else "" for c in result])

    return result

But this what i'm getting after executing it:

{'am': 3,
 'god!': 1,
 'god.': 1,
 'i': 2,
 'i?': 1,
 'thanks': 1,
 'to': 1,
 'who': 2}

I just need to remove he punctuation at the end of the words.

Upvotes: 0

Views: 2906

Answers (4)

R.R.C.
R.R.C.

Reputation: 6711

Maybe if you want to reuse the words later, you can store them in a sub-dictionary along with its ocurrences number. Each word will have its place in a dictionary. We can create our own function to remove punctuation, pretty simple. See if the code bellow serves your needs:

def remove_punctuation(word):
    for c in word:
        if not c.isalpha():
            word = word.replace(c, '')
    return word


def word_dic(s):
    words = s.lower().split(' ')
    result = {}

    for word in words:
        word = remove_punctuation(word)

        if not result.get(word, None):
            result[word] = {
                'word': word,
                'ocurrences': 1,
            }
            continue
        result[word]['ocurrences'] += 1  

    return result


phrase = 'Who am I and who are you? Are we gods? Gods are we? We are what we are!'
print(word_dic(phrase))

and you'll have an output like this:

{ 'who': { 'word': 'who', 'ocurrences': 2}, 'am': { 'word': 'am', 'ocurrences': 1}, 'i': { 'word': 'i', 'ocurrences': 1}, 'and': { 'word': 'and', 'ocurrences': 1}, 'are': { 'word': 'are', 'ocurrences': 5}, 'you': { 'word': 'you', 'ocurrences': 1}, 'we': { 'word': 'we', 'ocurrences': 4}, 'gods': { 'word': 'gods', 'ocurrences': 2}, 'what': { 'word': 'what', 'ocurrences': 1} }

Then you can easily access each word and its ocurrences simply doing:

word_dict(phrase)['are']['word']       # output: are
word_dict(phrase)['are']['ocurrences'] # output: 5

Upvotes: 0

Olivier Melançon
Olivier Melançon

Reputation: 22314

You can use string.punctuation to recognize punctuation and use collections.Counter to count occurence once the string is correctly decomposed.

from collections import Counter
from string import punctuation

line = "It's a test and it's a good ol' one."

Counter(word.strip(punctuation) for word in line.casefold().split())
# Counter({"it's": 2, 'a': 2, 'test': 1, 'and': 1, 'good': 1, 'ol': 1, 'one': 1})

Using str.strip instead of str.replace allows to preserve words such as It's.

The method str.casefold is simply a more general case of str.lower.

Upvotes: 0

ShadowRanger
ShadowRanger

Reputation: 155418

"".join([ c if not c.isalpha() else "" for c in result]) creates a new string without the punctuation, but it doesn't do anything with it; it's thrown away immediately, because you never store the result.

Really, the best way to do this is to normalize your keys before counting them in result. For example, you might do:

for key in new_string:
    # Keep only the alphabetic parts of each key, and replace key for future use
    key = "".join([c for c in key if c.isalpha()])
    if key in result:
        result[key] += 1
    else:
        result[key] = 1

Now result never has keys with punctuation (and the counts for "god." and "god!" are summed under the key "god" alone), and there is no need for another pass to strip the punctuation after the fact.

Alternatively, if you only care about leading and trailing punctuation on each word (so "it's" should be preserved as is, not converted to "its"), you can simplify a lot further. Simply import string, then change:

    key = "".join([c for c in key if c.isalpha()])

to:

    key = key.rstrip(string.punctuation)

This matches what you specifically asked for in your question (remove punctuation at the end of words, but not at the beginning or embedded within the word).

Upvotes: 2

randomir
randomir

Reputation: 18697

Another option is to use that famous Python's batteries included.

>>> sentence = 'Is this a test? It could be!'
>>> from collections import Counter
>>> Counter(re.sub('\W', ' ', sentence.lower()).split())
Counter({'a': 1, 'be': 1, 'this': 1, 'is': 1, 'it': 1, 'test': 1, 'could': 1})

Leverages collections.Counter for counting words, and re.sub for replacing everything that's not a word character.

Upvotes: 3

Related Questions