Karthik Apadodharanan
Karthik Apadodharanan

Reputation: 95

CLustering Similar values in dataframe based on averages

I Have a Dataframe that has records for the Zone wise sales, need to cluster them based on avg sales

Zone         Consumption
North          1
South          3
East           10
North          8
North2         0
South          5

I used the below code

def Clustering(row):
    if row['Consumption']<.5*np.mean(['Consumption']):
        val='E'
    elif row['Consumption']<.75*np.mean(['Consumption']):
        val='D'
    elif row['Consumption']<1*np.mean(['Consumption']):
        val='C'
    elif row['Consumption']<1.5*np.mean(['Consumption']):
        val='B'
    elif row['Consumption']<2.5*np.mean(['Consumption']):
        val='A'
    else:
        val='Z'
    return val

Traceback

<ipython-input-21-f08d8263edc0> in Clustering(row)
      1 def Clustering(row):
----> 2     if row['Consumption']<.5*np.mean(['Consumption']):
      3         val='E'
      4     elif row['Consumption']<.75*np.mean(['Consumption']):
      5         val='D'

<__array_function__ internals> in mean(*args, **kwargs)

~\anaconda3\lib\site-packages\numpy\core\fromnumeric.py in mean(a, axis, dtype, out, keepdims)
   3333 
   3334     return _methods._mean(a, axis=axis, dtype=dtype,
-> 3335                           out=out, **kwargs)
   3336 
   3337 

~\anaconda3\lib\site-packages\numpy\core\_methods.py in _mean(a, axis, dtype, out, keepdims)
    149             is_float16_result = True
    150 
--> 151     ret = umr_sum(arr, axis, dtype, out, keepdims)
    152     if isinstance(ret, mu.ndarray):
    153         ret = um.true_divide(

TypeError: cannot perform reduce with flexible type

My assumption was that the error is caused due to maybe the Sales column having some str values but that isnt the case, how shoud i go abt fixing this.

Upvotes: 0

Views: 127

Answers (1)

Code Different
Code Different

Reputation: 93181

Have you tried pd.cut? Assuming df['Consumption'].mean() >= 0:

# Define the bins, which are double-ended by -INF and INF
bins = np.array([.5, .75, 1, 1.5, 2.5]) * df['Consumption'].mean()
bins = np.hstack((np.NINF, bins, np.inf))

df['Cluster'] = pd.cut(df['Consumption'], bins, labels=list('EDCBAZ')).astype('str')

Upvotes: 1

Related Questions