Reputation: 29
I am writing a program that grabs a txt file off the internet and reads it. It then displays a bunch of data related to that txt file. Now, this all works well, until we get to the end. The last thing I want to do is display the top 10 most frequent words used in the txt file. The code I have right now only displays the most frequent word 10 times. Can someone look at this and tell me what the problem is? The only part you have to look at is the last part.
import urllib
open = urllib.urlopen("http://www.textfiles.com/etext/FICTION/alice30.txt").read()
v = str(open) # this variable makes the file a string
strip = v.replace(" ", "") # this trims spaces
char = len(strip) # this variable counts the number of characters in the string
ch = v.splitlines() # this variable seperates the lines
line = len(ch) # this counts the number of lines
print "Here's the number of lines in your file:", line
wordz = v.split()
print wordz
print "Here's the number of characters in your file:", char
spaces = v.count(' ')
words = ''.join(c if c.isalnum() else ' ' for c in v).split()
words = len(words)
print "Here's the number of words in your file:", words
topten = map(lambda x:filter(str.isalpha,x.lower()),v.split())
print "\n".join(sorted(words,key=words.count)[-10:][::-1])
Upvotes: 0
Views: 834
Reputation: 180411
Use collections.Counter
to count all the words, Counter.most_common(10)
will return the ten most common words and their count
wordz = v.split()
from collections import Counter
c = Counter(wordz)
print(c.most_common(10))
Using with to open the file and get a count of all the words in the txt file:
from collections import Counter
with open("http://www.textfiles.com/etext/FICTION/alice30.txt") as f:
c = Counter()
for line in f:
c.update(line.split()) # Counter.update adds the values
print(c.most_common(10))
To get total characters in the file get the sum of length of each key multiplied by the times it appears:
print(sum(len(k)*v for k,v in c.items()))
To get the word count:
print(sum(c.values()))
Upvotes: 2