Reputation:
I would like to count each word in a text file in python.
My text is this
Aragon is an autonomous community in northeastern Spain. The capital of Aragon is Zaragoza, which is also the most populous city in the autonomous community. Covering an area of 47720 km2, the region's terrain ranges from permanent glaciers through verdant valleys, rich pastures and orchards to the arid steppe plains of the central lowlands. Aragon is home to many rivers, most notably the Ebro, Spain's largest river, which flows west to east throughout the region through the province of Zaragoza. It is also home to the highest mountains in the Pyrenees.
I gave the following code
file=open("data/aragon.txt",'r')
from collections import Counter
wordcount = Counter(file.read().split())
for item in wordcount.items(): print("{}\t{}".format(*item))
But the problem is, it doesnt come in an order. I would like to have that the highest is at the top and lowest on the other side and don't have any words like this: Ebro, or Spain. no point or comma just word
How can I fix that?
Upvotes: 1
Views: 170
Reputation: 56
Use wordcount.most_common()
instead of wordcount.items()
from collections import Counter
wordcount = Counter(file.read().split())
for item in wordcount.most_common(): print("{}\t{}".format(*item))
on the other side and don't have any words like this: Ebro
Actually you have.
Upvotes: 0
Reputation: 1530
Code:
import re
file = open("t1.txt", "rt")
data = file.read()
[re.sub('[^a-zA-Z0-9]+', '', _).lower() for _ in data.split(' ')].count('is')
Upvotes: 0
Reputation: 3131
Maybe you can use regex and match words
from collections import Counter
import re
wordcount = Counter(re.findall('\w+', file.read()))
for item in wordcount.most_common(): print("{}\t{}".format(*item))
Upvotes: 0
Reputation: 1845
import re
from collections import Counter
file=open("data/aragon.txt",'r')
text_punctuation = r"[\.,;:!\|\?\x0c\_\"'\{\}\[\]\(\)=\-+/<>]"
t = re.sub(text_punctuation, " ", txt)
wordcount = Counter(t.split(' '))
for item in wordcount.items(): print("{}\t{}".format(*item))
Output:
Aragon 3
is 5
an 2
autonomous 2
community 2
in 3
northeastern 1
Spain 2
11
The 1
capital 1
of 4
Zaragoza 2
which 2
also 2
the 10
most 2
populous 1
city 1
Covering 1
area 1
47720 1
km2 1
region 2
s 2
terrain 1
ranges 1
from 1
permanent 1
glaciers 1
through 2
verdant 1
valleys 1
rich 1
pastures 1
and 1
orchards 1
to 4
arid 1
steppe 1
plains 1
central 1
lowlands 1
home 2
many 1
rivers 1
notably 1
Ebro 1
largest 1
river 1
flows 1
west 1
east 1
throughout 1
province 1
It 1
highest 1
mountains 1
Pyrenees 1
Upvotes: 0
Reputation: 36873
You might use .isalnum
method to check if character is letter or digit, for simplicity I will use part of your text stored in variable, as opposed to reading whole from file
import collections
aragon = "Aragon is an autonomous community in northeastern Spain. The capital of Aragon is Zaragoza, which is also the most populous city in the autonomous community. Covering an area of 47720 km2."
aragon_cleaned = ''.join(char if char.isalnum() else ' ' for char in aragon)
print(aragon_cleaned)
gives output
Aragon is an autonomous community in northeastern Spain The capital of Aragon is Zaragoza which is also the most populous city in the autonomous community Covering an area of 47720 km2
observe that ,
and .
were replaced by spaces, double space is not problem, because .split()
does split at one-or-more whitespaces. For getting elements from popoular to least you might use most_common
method like so
for word, count in collections.Counter(aragon_cleaned.split()).most_common():
print(word,'occured',count,'times')
gives output
is occured 3 times
Aragon occured 2 times
an occured 2 times
autonomous occured 2 times
community occured 2 times
in occured 2 times
of occured 2 times
the occured 2 times
northeastern occured 1 times
Spain occured 1 times
The occured 1 times
capital occured 1 times
Zaragoza occured 1 times
which occured 1 times
also occured 1 times
most occured 1 times
populous occured 1 times
city occured 1 times
Covering occured 1 times
area occured 1 times
47720 occured 1 times
km2 occured 1 times
Upvotes: 1