pylearner
pylearner

Reputation: 1460

function to return the highest count value using a rule

I have two columns like shown below, and trying to return the highest count of the second column, but its just returning me the highest count on rating without considering the gender

DATA :

print (df)

   AGE GENDER rating
0   10      M     PG
1   10      M      R
2   10      M      R
3    4      F   PG13
4    4      F   PG13

CODE :

 s = (df.groupby(['AGE', 'GENDER'])['rating']
       .apply(lambda x: x.value_counts().head(2))
       .rename_axis(('a','b', 'c'))
       .reset_index(level=2)['c'])

OUTPUT :

print (s[F])
('PG')

print(s[M]

('PG', 'R')

Upvotes: 1

Views: 971

Answers (2)

jezrael
jezrael

Reputation: 862921

I think you need for counts with categories and ratings use groupby + value_counts + head:

df1 = (df.groupby('gender')['rating']
         .apply(lambda x: x.value_counts().head(1))
         .rename_axis(('gender','rating'))
         .reset_index(name='val'))
print (df1)
  gender rating  val
0      F   PG13    2
1      M      R    2

If want only top ratings seelct first value of index per group:

s = df.groupby('gender')['rating'].apply(lambda x: x.value_counts().index[0])
print (s)
gender
F    PG13
M       R
Name: rating, dtype: object

print (s['M'])
R
print (s['F'])
PG13

Or only top counts select first value of Series per group:

s = df.groupby('gender')['rating'].apply(lambda x: x.value_counts().iat[0])
print (s)
gender
F    2
M    2
Name: rating, dtype: int64

print (s['M'])
2
print (s['F'])
2

EDIT:

s = df.groupby('gender')['rating'].apply(lambda x: x.value_counts().index[0])

def gen_mpaa(gender):
    return s[gender]

print (gen_mpaa('M'))

print (gen_mpaa('F'))

EDIT:

Solution if genre id values are strings:

print (type(df.loc[0, 'genre id']))
<class 'str'>

df = df.set_index('gender')['genre id'].str.split(',', expand=True).stack()
print (df)
gender   
M       0    11
        1    22
        2    33
        0    22
        1    44
        2    55
        0    33
        1    44
        2    55
F       0    11
        1    22
        0    22
        1    55
        0    55
        1    44
dtype: object

d = df.groupby(level=0).apply(lambda x: x.value_counts().index[0]).to_dict()
print (d)
{'M': '55', 'F': '55'}

EDIT1:

print (df)
   AGE GENDER rating
0   10      M     PG
1   10      M      R
2   10      M      R
3    4      F   PG13
4    4      F   PG13

s = (df.groupby(['AGE', 'GENDER'])['rating']
       .apply(lambda x: x.value_counts().head(2))
       .rename_axis(('a','b', 'c'))
       .reset_index(level=2)['c'])
print (s)

a   b
4   F    PG13
10  M       R
    M      PG
Name: c, dtype: object

Upvotes: 1

pylang
pylang

Reputation: 44525

Here is a standard library solution for this file:

%%file "test.txt"
gender  rating
M   PG
M   R
F   NR
M   R
F   PG13
F   PG13

Given

import collections as ct


def read_file(fname):
    with open(fname, "r") as f:
        header = next(f)
        for line in f:
            gender, rating = line.strip().split()
            yield gender, rating

Code

filename = "test.txt"

dd = ct.defaultdict(ct.Counter)
for k, v in sorted(read_file(filename), key=lambda x: x[0]):
    dd[k][v] += 1 

{k: v.most_common(1) for k, v in dd.items()}
# {'F': [('PG13', 2)], 'M': [('R', 2)]}

Details

Each line of the file is parse and added to a defaultdict. The keys are genders, but the values are Counter objects for each rating per gender. Counter.most_common() is called to retrieve the top occurrences.

Since the data is grouped by gender, you can explore more information. For example, unique ratings of each gender:

{k: set(v.elements()) for k, v in dd.items()}
# {'F': {'NR', 'PG13'}, 'M': {'PG', 'R'}}

Upvotes: 2

Related Questions