Need assistance with cleaning words that were counted from a text file

Question

I have an input text file from which I have to count sum of characters, sum of lines, and sum of each word.

So far I have been able to get the count of characters, lines and words. I also converted the text to all lower case so I don't get 2 different counts for same word where one is in lower case and the other is in upper case.

Now looking at the output I realized that, the count of words is not as clean. I have been struggling to output clean data where it does not count any special characters, and also when counting words not to include a period or a comma at the end of it.

Ex. if the text file contains the line: "Hello, I am Bob. Hello to Bob *"

it should output:
2 Hello
2 Bob
1 I
1 am
1 to

Instead my code outputs
1 Hello,
1 Hello
1 Bob.
1 Bob
1 I
1 am
1 to
1 *

Below is the code I have as of now.

# Open the input file
fname = open('2013_honda_accord.txt', 'r').read()

# COUNT CHARACTERS
num_chars = len(fname)

# COUNT LINES 
num_lines = fname.count('
')

#COUNT WORDS
fname = fname.lower() # convert the text to lower first
words = fname.split()
d = {}
for w in words:
    # if the word is repeated - start count
    if w in d:    
       d[w] += 1
    # if the word is only used once then give it a count of 1
    else:
       d[w] = 1

# Add the sum of all the repeated words 
num_words = sum(d[w] for w in d)

lst = [(d[w], w) for w in d]
# sort the list of words in alpha for the same count 
lst.sort()
# list word count from greatest to lowest (will also show the sort in reserve order Z-A)
lst.reverse()

# output the total number of characters
print('Your input file has characters = ' + str(num_chars))
# output the total number of lines
print('Your input file has num_lines = ' + str(num_lines))
# output the total number of words
print('Your input file has num_words = ' + str(num_words))

print('
 The 30 most frequent words are 
')

# print the number of words as a count from the text file with the sum of each word used within the text
i = 1
for count, word in lst[:10000]:
print('%2s.  %4s %s' % (i, count, word))
i += 1

Thanks

Dion Bridger · Accepted Answer

Try replacing

words = fname.split()

With

get_alphabetical_characters = lambda word: "".join([char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word])
words = list(map(get_alphabetical_characters, fname.split()))

Let me explain the various parts of the code.

Starting with the first line, whenever you have a declaration of the form

function_name = lambda argument1, argument2, ..., argumentN: some_python_expression

What you're looking at is the definition of a function that doesn't have any side effects, meaning it can't change the value of variables, it can only return a value.

So get_alphabetical_characters is a function that we know due to the suggestive name, that it takes a word and returns only the alphabetical characters contained within it.

This is accomplished using the "".join(some_list) idiom which takes a list of strings and concatenates them (in other words, it producing a single string by joining them together in the given order).

And the some_list here is provided by the generator expression [char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word]

What this does is it steps through every character in the given word, and puts it into the list if it's alphebetical, or if it isn't it puts a blank string in it's place.

For example

[char if char in 'abcdefghijklmnopqrstuvwyz' else '' for char in "hello."]

Evaluates to the following list:

['h','e','l','l','o','']

Which is then evaluates by

"".join(['h','e','l','l','o',''])

Which is equivalent to

'h'+'e'+'l'+'l'+'o'+''

Notice that the blank string added at the end will not have any effect. Adding a blank string to any string returns that same string again. And this in turn ultimately yields

"hello"

Hope that's clear!

Edit #2: If you want to include periods used to mark decimal we can write a function like this:

include_char = lambda pos, a_string: a_string[pos].isalnum() or a_string[pos] == '.' and a_string[pos-1:pos].isdigit()
words = "".join(map(include_char, fname)).split()

What we're doing here is that the include_char function checks if a character is "alphanumeric" (i.e. is a letter or a digit) or that it's a period and that the character preceding it is numeric, and using this function to strip out all the characters in the string we want, and joining them into a single string, which we then separate into a list of strings using the str.split method.

Need assistance with cleaning words that were counted from a text file

Answers (2)

Related Questions