bw61293
bw61293

Reputation: 91

Python: Count how many times a word occurs in a file

I have a file that contains a city name and then a state name per line in the file. I am suppose to count how many times a state name occurs and return the value.

for example, if my file contained:

Los Angeles   California
San Diego     California
San Francisco California
Albany        New York
Buffalo       New York
Orlando       Florida

I am suppose to return how many times each state name occurs. I have this for California.

for line in f:
    California_count=line.find("California")
    if California_count!=-1:
        total=line.count("California")
print(total)

This only gives me the value 1, which I am assuming is because it occurs 1 time per line. How do I get it to return the number 3 instead of the number 1?

Upvotes: 4

Views: 30251

Answers (5)

Lachlan Moore
Lachlan Moore

Reputation: 53

The accepted Answer for this common problem I believe covers what 'bw61293' asked for because of the format of his Text File, but is not a general solution for all Text Files!

He asked for 'Count how many times a word occurs in a file', the accepted answer can only count the word 'California' once per line. So if the word appears twice on a line then it will only count it once. Although this does work for the given format, it is not a general solution to say if the 'file' was a book.

A fix to the Accepted answer would be below, of using nltk to break the line into a list of words. The only problem is make sure to pip install the nltk library with 'pip install nltk' in Command Prompt, beware its a big library. If you want to use Anaconda use 'conda install -c anaconda nltk'. I used the Tweet Tokenizer because apostrophes in words like "don't" will split the string into a list ['don', "'t"] but the TweetTokenizer will return ["don't"], among other reasons. I also made it case insensitive by just using .lower() in .count(). I hope this will help people who want a more general solution to the question of 'Count how many times a word occurs in a file'.

I am new to StackOverflow so please give feedback to improvements to my code or to what I have written for my first comment ever!

UPDATE I MADE AN ERROR, below is now fixed!! (Keep in mind this is a case insensitive search, if you want it case sensitive please remove the .lower() from the list comprehension. Thanks.) I also promise to make an answer without using nltk when I get enough time.

from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()

total = 0

with open('input.txt') as f:
    for line in f:
        LineList = tknzr.tokenize(line)
        LineLower = [x.lower() for x in LineList]
        found = LineLower.count('california')
        if found != -1 and found != 0:
            total += found

print(total)

Upvotes: 2

Bruno Gelb
Bruno Gelb

Reputation: 5662

total = 0

with open('input.txt') as f:
    for line in f:
        found = line.find('California')
        if found != -1 and found != 0:
            total += 1

print total

output:

3

Upvotes: 4

Denis
Denis

Reputation: 31

Alternatively, you could just use the re module, and regex it:

import re

states = """
Los Angeles   California
San Diego     California
San Francisco California
Albany        New York
Buffalo       New York
Orlando       Florida
"""

found = re.findall('[cC]alifornia', states)

total = 0

for i in found:
    total += 1

print total

Upvotes: 3

Lily Mara
Lily Mara

Reputation: 4138

Assuming that the spaces in your post are meant to be tabs, the following code will give you a dict containing the counts for all of the states in the file.

#!/usr/bin/env python3

counts = {}
with open('states.txt', 'r') as statefile:
    for i in statefile:
        state = i.split('\t')[1].rstrip()
        if state not in counts:
            counts[state] = 0
        else:
            counts[state] += 1
    print(counts)

Upvotes: 1

m.wasowski
m.wasowski

Reputation: 6386

Use dictionary for storing counters:

data = """Los Angeles   California
San Diego     California
San Francisco California
Albany        New York
Buffalo       New York
Orlando       Florida""".splitlines()

counters = {}
for line in data:
    city, state = line[:14], line[14:]
    # city, state = line.split('\t') # if separated by tabulator
    if state not in counters:
        counters[state] = 1
    else:
        counters[state] += 1
print counters
# {'Florida': 1, 'New York': 2, 'California': 3}

You can simplify it by using collections.defaultdict:

from collections import defaultdict
counter = defaultdict(int)
for line in data:
    city, state = line[:14], line[14:]
    counter[state] += 1

print counter
# defaultdict(<type 'int'>, {'Florida': 1, 'New York': 2, 'California': 3})

or using collections.Counter and generator expression:

from collections import Counter
states = Counter(line[14:] for line in data)
# Counter({'California': 3, 'New York': 2, 'Florida': 1})

Upvotes: 7

Related Questions