Reputation:

Create a new column with frequency-based categories in pandas

I would need to create a new column as follows:

if the frequency of an item is greater or equal than 5 then set 'best seller';
if the frequency of an item is between 2 (inclusive) and 5 then set 'ok';
if the frequency of an item is lower than 2 then set 'bad'.

Suppose that my dataset looks like

Items          Date 
calzini      2020/02/23
cintura      2020/02/21
maglietta    2020/02/23
maglietta    2020/02/22
cappello     2020/02/23
jeans        2020/02/23
cappello     2020/02/22
maglietta    2020/02/22
maglietta    2020/02/22
jeans        2020/02/22
jeans        2020/02/23
maglietta    2020/02/23
jeans        2020/02/22
jeans        2020/02/23

I would like to have

Items         Category            
calzini        bad
cintura        bad
maglietta     best seller
maglietta     best seller
jeans         best seller
cappello       ok
jeans         best seller
cappello       ok
maglietta     best seller
maglietta     best seller
jeans         best seller
maglietta     best seller
jeans         best seller
jeans         best seller

I already determined the frequency of the items as follows:

sold_items=df.groupby(['Items'])['Date'].count().sort_values(ascending=False) # the items should be counted overall, not using a specific Date! It is about how many items were sold

I would like to ask you how to create a new column with those values.

Upvotes: 0

Answers (6)

StupidWolf

Reputation: 46898

You can use cut on value_counts:

pd.cut(df['Items'].value_counts(),bins=[0,1,4,10])

maglietta    (4, 10]
jeans        (4, 10]
cappello      (1, 4]
calzini       (0, 1]
cintura       (0, 1]
Name: Items, dtype: category
Categories (3, interval[int64]): [(0, 1] < (1, 4] < (4, 10]]

So this cut does not include the lowest, hence round brackets on left, and includes highlist, square brackets on right. Now we convert these labels to what you need:

cats = pd.cut(df['Items'].value_counts(),bins=[0,1,4,10],labels=['bad','ok','best seller'])

And just map the values according to the category and assign it to a new column using .tonumpy() (thanks to @Ch3steR for pointing it out, see comments):

df['Category'] = cats[df['Items']].to_numpy()

df

    Items       Date        Category
0   calzini     2020/02/23  bad
1   cintura     2020/02/21  bad
2   maglietta   2020/02/23  best seller
3   maglietta   2020/02/22  best seller
4   cappello    2020/02/23  ok
5   jeans       2020/02/23  best seller
6   cappello    2020/02/22  ok
7   maglietta   2020/02/22  best seller
8   maglietta   2020/02/22  best seller
9   jeans       2020/02/22  best seller
10  jeans       2020/02/23  best seller
11  maglietta   2020/02/23  best seller
12  jeans       2020/02/22  best seller
13  jeans       2020/02/23  best seller

You can also use df['Category'] = df['Items'].map(cats)

Upvotes: 2

Tom

Reputation: 8790

You can also replace the conditional values in value_counts and then map:

counts = df['Items'].value_counts()
counts = counts.replace(counts.values, ['best seller' if i >= 5 else ('ok' if i in [2,3,4] else 'bad') for i in counts])
df['category'] = df['Items'].map(counts)

Upvotes: 0

A.B

Reputation: 20445

According to what you have defined, you dont have dependency of category on date(as assumed by the outpu).

You can simply use apply function on axis 1

def testfun(e):
  count = len(df[df["Items"] == e["Items"]])
  if(count>=5):
    return "best seller"
  if(count>=2 and count<5 ):
    return "ok"
  else:
    return "bad"    


df["count"] = df.apply(testfun,axis=1)

1   cintura bad
2   maglietta   best seller
3   maglietta   best seller
4   cappello    ok
5   jeans   best seller
6   cappello    ok
7   maglietta   best seller
8   maglietta   best seller
9   jeans   best seller
10  jeans   best seller
11  maglietta   best seller
12  jeans   best seller
13  jeans   best seller

Upvotes: 0

rhug123

Reputation: 8768

The code below should work.

df['category'] = pd.cut(df['sold_items'],bins = [0,1,4,df['sold_items'].max()],labels = ['bad','ok','best seller'])

Upvotes: 3

Ch3steR

Reputation: 20669

You can use GroupBy.transform and use np.select

vals = df['Items'].groupby(df['Items']).transform('count')
condlist = [vals.ge(5), (vals.ge(2) & vals.lt(5)) , vals.le(2)]
choicelist = ['best seller', 'ok', 'bad']
df.assign(category =  np.select(condlist, choicelist))

        Items        Date     category
0     calzini  2020/02/23          bad
1     cintura  2020/02/21          bad
2   maglietta  2020/02/23  best seller
3   maglietta  2020/02/22  best seller
4    cappello  2020/02/23           ok
5       jeans  2020/02/23  best seller
6    cappello  2020/02/22           ok
7   maglietta  2020/02/22  best seller
8   maglietta  2020/02/22  best seller
9       jeans  2020/02/22  best seller
10      jeans  2020/02/23  best seller
11  maglietta  2020/02/23  best seller
12      jeans  2020/02/22  best seller
13      jeans  2020/02/23  best seller

Upvotes: 2

busybear

Reputation: 10590

Use groupby and transform. You'll also need to create a function to categorize your items:

def categorize(x):
    num = len(x)
    if num >= 5:
        return 'best seller'
    elif num >= 3:
        return 'ok'
    else:
        return 'bad'

df['category'] = df.groupby('Items').transform(categorize)

Upvotes: 0

Create a new column with frequency-based categories in pandas

Answers (6)

Related Questions