aaaabbbb
aaaabbbb

Reputation: 133

Python count and probability

I have the following data :

Name    Item
peter   apple
peter   apple
Ben     banana
peter   banana

I want to print

frequency of what peter eat :
apple 2 
banana 1 

this is my code

u, count = np.unique(data['Item'], return_counts=True)

process = u[np.where(data['Name']= 'peter')[0]]

process2 = dict(Counter(process))
print "Item\frequency"

for k, v in process2.items():
print '{0:.0f}\t{1}'.format(k,v)

but it got error I also want to calculate the probability of peter eat apple next time but I dont have any idea , any suggestion ?

Upvotes: 3

Views: 3403

Answers (4)

Anand S Kumar
Anand S Kumar

Reputation: 90889

The error you are getting is as the other answer indicates, you cannot use data['Name'] = 'peter' as a function parameter, you actually intended to use - np.where(data['Name'] == 'peter') .

But, given that you are using pandas , and I am guessing data is a pandas DataFrame . In which case, what you really want can be achieved using DataFrame.groupby. Example -

data[data['Name']=='peter'].groupby('Item').count()

Demo -

In [7]: data[data['Name']=='peter'].groupby('Item').count()
Out[7]:
        Name
Item
apple      2
banana     1

If you want this printed in a loop, you can use -

df = data[data['Name']=='peter'].groupby('Item').count()
for fruit,count in df['Name'].iteritems():
    print('{0}\t{1}'.format(fruit,count))

Demo -

In [24]: df = data[data['Name']=='peter'].groupby('Item').count()

In [25]: for fruit,count in df['Name'].iteritems():
   ....:     print('{0}\t{1}'.format(fruit,count))
   ....:
apple   2
banana  1

For the updated issue that the OP was getting, where he was getting the following error -

TypeError: invalid type comparison

The issue occurs in this case because in the real data for the OP , the column has numeric values (float/int) , but the OP was comparing the values against string, and hence getting the error. Example -

In [30]: df
Out[30]:
   0  1
0  1  2

In [31]: df[0]=='asd'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-31-e7bacd79d320> in <module>()
----> 1 df[0]=='asd'

C:\Anaconda3\lib\site-packages\pandas\core\ops.py in wrapper(self, other, axis)
    612
    613             # scalars
--> 614             res = na_op(values, other)
    615             if np.isscalar(res):
    616                 raise TypeError('Could not compare %s type with Series'

C:\Anaconda3\lib\site-packages\pandas\core\ops.py in na_op(x, y)
    566                 result = getattr(x, name)(y)
    567                 if result is NotImplemented:
--> 568                     raise TypeError("invalid type comparison")
    569             except (AttributeError):
    570                 result = op(x, y)

TypeError: invalid type comparison

If your column is numeric, you should compare against numeric values, not string.

Upvotes: 2

Andy Hayden
Andy Hayden

Reputation: 375415

You can groupby the name and use value_counts:

In [11]: df.groupby("Name")["Item"].value_counts()
Out[11]:
Name
Ben    banana    1
peter  apple     2
       banana    1
dtype: int64

Potentially you could unstack these into columns:

In [12]: df.groupby("Name")["Item"].value_counts().unstack(1)
Out[12]:
       apple  banana
Name
Ben      NaN       1
peter      2       1

In [13]: res = df.groupby("Name")["Item"].value_counts().unstack(1).fillna(0)

In [13]: res
Out[13]:
       apple  banana
Name
Ben        0       1
peter      2       1

To get the probabilities divide by the sum:

In [14]: res = res.div(res.sum(axis=1), axis=0)

In [15]: res
Out[15]:
          apple    banana
Name
Ben    0.000000  1.000000
peter  0.666667  0.333333

and the probability peter eats an apple next time:

In [16]: res.loc["peter", "apple"]
Out[16]: 0.66666666666666663

Upvotes: 2

inspectorG4dget
inspectorG4dget

Reputation: 113905

If you're not dead set on using numpy:

import collections
import csv

data = collections.defaultdict(lambda: collections.defaultdict(int))
with open('path/to/file') as infile:
    infile.readline()  # fet rid of the header
    for name, food in csv.reader(infile):
        data[name][food] += 1

for name, d in data.iteritems():
    print("frequency of what" name, "ate:")
    total = float(sum(d.values()))
    for food, count in d.iteritems():
        print(food, count, "probability:", count/total)

Upvotes: 0

Jacob Ritchie
Jacob Ritchie

Reputation: 1401

I'm not super familiar with Pandas or NumPy, but one problem I can see is that:

data['Name'] = 'peter'

is an assignment statement.

Whereas you probably want to check for equality:

data['Name'] == 'peter'

Also, unless your indentation was messed up in pasting your code here, you need to indent the body of your for statement, or you'll find another error once you cleared up this one.

for k, v in process2.items():
    print '{0:.0f}\t{1}'.format(k,v)

Upvotes: 0

Related Questions