Chaffy
Chaffy

Reputation: 182

Count frequency of words under given index in a file

I am trying to count occurrence of words under specific index in my file and print it out as a dictionary.

def count_by_fruit(file_name="file_with_fruit_data.txt"):
    with open(file_name, "r") as file:
        content_of_file = file.readlines()
        dict_of_fruit_count = {}
        for line in content_of_file:
            line = line[0:-1]
            line = line.split("\t")
            for fruit in line:
                fruit = line[1]
                dict_of_fruit_count[fruit] = dict_of_fruit_count.get(fruit, 0) + 1
    return dict_of_fruit_count


print(count_by_fruit())

Output: {'apple': 6, 'banana': 6, 'orange': 3}

I am getting this output, however, it doesn't count frequency of the words correctly. After searching around I didn't seem to find the proper solution. Could anyone help me to identify my mistake?

My file has the following content: (data separated with tabs, put "\t" in example as format is being altered by stackoverflow)

  1. I am line one with \t apple \t from 2018
  2. I am line two with \t orange \t from 2017
  3. I am line three with \t apple \t from 2016
  4. I am line four with \t banana \t from 2010
  5. I am line five with \t banana \t from 1999

Upvotes: 1

Views: 100

Answers (2)

dawg
dawg

Reputation: 103694

You are looping too many times over the same line. Notice that the results you are getting are all 3 times what you are expecting.

Also, in Python, you also do not need to read the entire file. Just iterate over the file object line by line.

Try:

def count_by_fruit(file_name="file_with_fruit_data.txt"):
    with open(file_name, "r") as f_in:
        dict_of_fruit_count = {}
        for line in f_in:
            fruit=line.split("\t")[1]
            dict_of_fruit_count[fruit] = dict_of_fruit_count.get(fruit, 0) + 1
    return dict_of_fruit_count

Which can be further simplified to:

def count_by_fruit(file_name="file_with_fruit_data.txt"):
    with open(file_name) as f_in:
        dict_of_fruit_count = {}
        for fruit in (line.split('\t')[1] for line in f_in):
            dict_of_fruit_count[fruit] = dict_of_fruit_count.get(fruit, 0) + 1
        return dict_of_fruit_count 

Or, if you can use Counter:

from collections import Counter 

def count_by_fruit(file_name="file_with_fruit_data.txt"):
    with open(file_name) as f_in:
        return dict(Counter(line.split('\t')[1] for line in f_in))

Upvotes: 1

Patrick Haugh
Patrick Haugh

Reputation: 60944

The problem is for fruit in line:. Splitting the lines on the tabs is going to split them into three parts. If you loop over those three parts every time, adding one to the count for each, then your counts are going to be 3 times as large as the actual data.

Below is how I would write this function, using generator expressions and Counter.

from collections import Counter

def count_by_fruit(file_name="file_with_fruit_data.txt"):
    with open(file_name, "r") as file:
        lines = (line[:-1] for line in file)
        fruit = (line.split('\t')[1] for line in lines)
        return Counter(fruit)

Upvotes: 1

Related Questions