Reputation: 107
I have an input text file from which I have to count sum of characters, sum of lines, and sum of each word.
So far I have been able to get the count of characters, lines and words. I also converted the text to all lower case so I don't get 2 different counts for same word where one is in lower case and the other is in upper case.
Now looking at the output I realized that, the count of words is not as clean. I have been struggling to output clean data where it does not count any special characters, and also when counting words not to include a period or a comma at the end of it.
Ex. if the text file contains the line: "Hello, I am Bob. Hello to Bob *"
it should output:
2 Hello
2 Bob
1 I
1 am
1 to
Instead my code outputs
1 Hello,
1 Hello
1 Bob.
1 Bob
1 I
1 am
1 to
1 *
Below is the code I have as of now.
# Open the input file
fname = open('2013_honda_accord.txt', 'r').read()
# COUNT CHARACTERS
num_chars = len(fname)
# COUNT LINES
num_lines = fname.count('\n')
#COUNT WORDS
fname = fname.lower() # convert the text to lower first
words = fname.split()
d = {}
for w in words:
# if the word is repeated - start count
if w in d:
d[w] += 1
# if the word is only used once then give it a count of 1
else:
d[w] = 1
# Add the sum of all the repeated words
num_words = sum(d[w] for w in d)
lst = [(d[w], w) for w in d]
# sort the list of words in alpha for the same count
lst.sort()
# list word count from greatest to lowest (will also show the sort in reserve order Z-A)
lst.reverse()
# output the total number of characters
print('Your input file has characters = ' + str(num_chars))
# output the total number of lines
print('Your input file has num_lines = ' + str(num_lines))
# output the total number of words
print('Your input file has num_words = ' + str(num_words))
print('\n The 30 most frequent words are \n')
# print the number of words as a count from the text file with the sum of each word used within the text
i = 1
for count, word in lst[:10000]:
print('%2s. %4s %s' % (i, count, word))
i += 1
Thanks
Upvotes: 0
Views: 237
Reputation: 66
Try replacing
words = fname.split()
With
get_alphabetical_characters = lambda word: "".join([char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word])
words = list(map(get_alphabetical_characters, fname.split()))
Let me explain the various parts of the code.
Starting with the first line, whenever you have a declaration of the form
function_name = lambda argument1, argument2, ..., argumentN: some_python_expression
What you're looking at is the definition of a function that doesn't have any side effects, meaning it can't change the value of variables, it can only return a value.
So get_alphabetical_characters
is a function that we know due to the suggestive name, that it takes a word and returns only the alphabetical characters contained within it.
This is accomplished using the "".join(some_list)
idiom which takes a list of strings and concatenates them (in other words, it producing a single string by joining them together in the given order).
And the some_list
here is provided by the generator expression [char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word]
What this does is it steps through every character in the given word, and puts it into the list if it's alphebetical, or if it isn't it puts a blank string in it's place.
For example
[char if char in 'abcdefghijklmnopqrstuvwyz' else '' for char in "hello."]
Evaluates to the following list:
['h','e','l','l','o','']
Which is then evaluates by
"".join(['h','e','l','l','o',''])
Which is equivalent to
'h'+'e'+'l'+'l'+'o'+''
Notice that the blank string added at the end will not have any effect. Adding a blank string to any string returns that same string again. And this in turn ultimately yields
"hello"
Hope that's clear!
Edit #2: If you want to include periods used to mark decimal we can write a function like this:
include_char = lambda pos, a_string: a_string[pos].isalnum() or a_string[pos] == '.' and a_string[pos-1:pos].isdigit()
words = "".join(map(include_char, fname)).split()
What we're doing here is that the include_char
function checks if a character is "alphanumeric" (i.e. is a letter or a digit) or that it's a period and that the character preceding it is numeric, and using this function to strip out all the characters in the string we want, and joining them into a single string, which we then separate into a list of strings using the str.split
method.
Upvotes: 0
Reputation: 2349
This program may help you:
#I created a list of characters that I don't want \
# them to be considered as words!
char2remove = (".",",",";","!","?","*",":")
#Received an string of the user.
string = raw_input("Enter your string: ")
#Make all the letters lower-case
string = string.lower()
#replace the special characters with white-space.
for char in char2remove:
string = string.replace(char," ")
#Extract all the words in the new string (have repeats)
words = string.split(" ")
#creating a dictionary to remove repeats
to_count = dict()
for word in words:
to_count[word]=0
#counting the word repeats.
for word in to_count:
#if there is space in a word, it is white-space!
if word.isalpha():
print word, string.count(word)
Works as below:
>>> ================================ RESTART ================================
>>>
Enter your string: Hello, I am Bob. Hello to Bob *
i 1
am 1
to 1
bob 2
hello 2
>>>
Another way is using Regex to remove all non-letter chars (to get rid off char2remove
list):
import re
regex = re.compile('[^a-zA-Z]')
your_str = raw_input("Enter String: ")
your_str = your_str.lower()
regex.sub(' ', your_str)
words = your_str.split(" ")
to_count = dict()
for word in words:
to_count[word]=0
for word in to_count:
if word.isalpha():
print word, your_str.count(word)
Upvotes: 0