Reputation: 133
I have the following data :
Name Item
peter apple
peter apple
Ben banana
peter banana
I want to print
frequency of what peter eat :
apple 2
banana 1
this is my code
u, count = np.unique(data['Item'], return_counts=True)
process = u[np.where(data['Name']= 'peter')[0]]
process2 = dict(Counter(process))
print "Item\frequency"
for k, v in process2.items():
print '{0:.0f}\t{1}'.format(k,v)
but it got error I also want to calculate the probability of peter eat apple next time but I dont have any idea , any suggestion ?
Upvotes: 3
Views: 3403
Reputation: 90889
The error you are getting is as the other answer indicates, you cannot use data['Name'] = 'peter'
as a function parameter, you actually intended to use - np.where(data['Name'] == 'peter')
.
But, given that you are using pandas
, and I am guessing data
is a pandas DataFrame
. In which case, what you really want can be achieved using DataFrame.groupby
. Example -
data[data['Name']=='peter'].groupby('Item').count()
Demo -
In [7]: data[data['Name']=='peter'].groupby('Item').count()
Out[7]:
Name
Item
apple 2
banana 1
If you want this printed in a loop, you can use -
df = data[data['Name']=='peter'].groupby('Item').count()
for fruit,count in df['Name'].iteritems():
print('{0}\t{1}'.format(fruit,count))
Demo -
In [24]: df = data[data['Name']=='peter'].groupby('Item').count()
In [25]: for fruit,count in df['Name'].iteritems():
....: print('{0}\t{1}'.format(fruit,count))
....:
apple 2
banana 1
For the updated issue that the OP was getting, where he was getting the following error -
TypeError: invalid type comparison
The issue occurs in this case because in the real data for the OP , the column has numeric values (float/int) , but the OP was comparing the values against string, and hence getting the error. Example -
In [30]: df
Out[30]:
0 1
0 1 2
In [31]: df[0]=='asd'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-31-e7bacd79d320> in <module>()
----> 1 df[0]=='asd'
C:\Anaconda3\lib\site-packages\pandas\core\ops.py in wrapper(self, other, axis)
612
613 # scalars
--> 614 res = na_op(values, other)
615 if np.isscalar(res):
616 raise TypeError('Could not compare %s type with Series'
C:\Anaconda3\lib\site-packages\pandas\core\ops.py in na_op(x, y)
566 result = getattr(x, name)(y)
567 if result is NotImplemented:
--> 568 raise TypeError("invalid type comparison")
569 except (AttributeError):
570 result = op(x, y)
TypeError: invalid type comparison
If your column is numeric, you should compare against numeric values, not string.
Upvotes: 2
Reputation: 375415
You can groupby the name and use value_counts
:
In [11]: df.groupby("Name")["Item"].value_counts()
Out[11]:
Name
Ben banana 1
peter apple 2
banana 1
dtype: int64
Potentially you could unstack these into columns:
In [12]: df.groupby("Name")["Item"].value_counts().unstack(1)
Out[12]:
apple banana
Name
Ben NaN 1
peter 2 1
In [13]: res = df.groupby("Name")["Item"].value_counts().unstack(1).fillna(0)
In [13]: res
Out[13]:
apple banana
Name
Ben 0 1
peter 2 1
To get the probabilities divide by the sum:
In [14]: res = res.div(res.sum(axis=1), axis=0)
In [15]: res
Out[15]:
apple banana
Name
Ben 0.000000 1.000000
peter 0.666667 0.333333
and the probability peter eats an apple next time:
In [16]: res.loc["peter", "apple"]
Out[16]: 0.66666666666666663
Upvotes: 2
Reputation: 113905
If you're not dead set on using numpy:
import collections
import csv
data = collections.defaultdict(lambda: collections.defaultdict(int))
with open('path/to/file') as infile:
infile.readline() # fet rid of the header
for name, food in csv.reader(infile):
data[name][food] += 1
for name, d in data.iteritems():
print("frequency of what" name, "ate:")
total = float(sum(d.values()))
for food, count in d.iteritems():
print(food, count, "probability:", count/total)
Upvotes: 0
Reputation: 1401
I'm not super familiar with Pandas or NumPy, but one problem I can see is that:
data['Name'] = 'peter'
is an assignment statement.
Whereas you probably want to check for equality:
data['Name'] == 'peter'
Also, unless your indentation was messed up in pasting your code here, you need to indent the body of your for statement, or you'll find another error once you cleared up this one.
for k, v in process2.items():
print '{0:.0f}\t{1}'.format(k,v)
Upvotes: 0