Reputation:

count words without point and comma in python

I would like to count each word in a text file in python.

My text is this

Aragon is an autonomous community in northeastern Spain. The capital of Aragon is Zaragoza, which is also the most populous city in the autonomous community. Covering an area of 47720 km2, the region's terrain ranges from permanent glaciers through verdant valleys, rich pastures and orchards to the arid steppe plains of the central lowlands. Aragon is home to many rivers, most notably the Ebro, Spain's largest river, which flows west to east throughout the region through the province of Zaragoza. It is also home to the highest mountains in the Pyrenees.

I gave the following code

file=open("data/aragon.txt",'r')
from collections import Counter
wordcount = Counter(file.read().split())
for item in wordcount.items(): print("{}\t{}".format(*item))

But the problem is, it doesnt come in an order. I would like to have that the highest is at the top and lowest on the other side and don't have any words like this: Ebro, or Spain. no point or comma just word

How can I fix that?

Upvotes: 1

Answers (5)

Abdullah Jannadi

Reputation: 56

Use wordcount.most_common() instead of wordcount.items()

from collections import Counter
wordcount = Counter(file.read().split())
for item in wordcount.most_common(): print("{}\t{}".format(*item))

on the other side and don't have any words like this: Ebro

Actually you have.

Upvotes: 0

R. Baraiya

Reputation: 1530

Code:

import re
file = open("t1.txt", "rt")
data = file.read()
[re.sub('[^a-zA-Z0-9]+', '', _).lower() for _ in data.split(' ')].count('is')

Upvotes: 0

Lucas M. Uriarte

Reputation: 3131

Maybe you can use regex and match words

from collections import Counter
import re
wordcount = Counter(re.findall('\w+', file.read()))
for item in wordcount.most_common(): print("{}\t{}".format(*item))

Upvotes: 0

Will

Reputation: 1845

import re
from collections import Counter

file=open("data/aragon.txt",'r')
text_punctuation = r"[\.,;:!\|\?\x0c\_\"'\{\}\[\]\(\)=\-+/<>]" 
t = re.sub(text_punctuation, " ", txt)
wordcount = Counter(t.split(' '))
for item in wordcount.items(): print("{}\t{}".format(*item)) 
 
Output:
Aragon  3
is  5
an  2
autonomous  2
community   2
in  3
northeastern    1
Spain   2
    11
The 1
capital 1
of  4
Zaragoza    2
which   2
also    2
the 10
most    2
populous    1
city    1
Covering    1
area    1
47720   1
km2 1
region  2
s   2
terrain 1
ranges  1
from    1
permanent   1
glaciers    1
through 2
verdant 1
valleys 1
rich    1
pastures    1
and 1
orchards    1
to  4
arid    1
steppe  1
plains  1
central 1
lowlands    1
home    2
many    1
rivers  1
notably 1
Ebro    1
largest 1
river   1
flows   1
west    1
east    1
throughout  1
province    1
It  1
highest 1
mountains   1
Pyrenees    1

Upvotes: 0

Daweo

Reputation: 36873

You might use .isalnum method to check if character is letter or digit, for simplicity I will use part of your text stored in variable, as opposed to reading whole from file

import collections
aragon = "Aragon is an autonomous community in northeastern Spain. The capital of Aragon is Zaragoza, which is also the most populous city in the autonomous community. Covering an area of 47720 km2."
aragon_cleaned = ''.join(char if char.isalnum() else ' ' for char in aragon)
print(aragon_cleaned)

gives output

Aragon is an autonomous community in northeastern Spain  The capital of Aragon is Zaragoza  which is also the most populous city in the autonomous community  Covering an area of 47720 km2

observe that , and . were replaced by spaces, double space is not problem, because .split() does split at one-or-more whitespaces. For getting elements from popoular to least you might use most_common method like so

for word, count in collections.Counter(aragon_cleaned.split()).most_common():
    print(word,'occured',count,'times')

gives output

is occured 3 times
Aragon occured 2 times
an occured 2 times
autonomous occured 2 times
community occured 2 times
in occured 2 times
of occured 2 times
the occured 2 times
northeastern occured 1 times
Spain occured 1 times
The occured 1 times
capital occured 1 times
Zaragoza occured 1 times
which occured 1 times
also occured 1 times
most occured 1 times
populous occured 1 times
city occured 1 times
Covering occured 1 times
area occured 1 times
47720 occured 1 times
km2 occured 1 times

Upvotes: 1

count words without point and comma in python

Answers (5)

Related Questions