UserYmY
UserYmY

Reputation: 8554

Python pandas: How to group by and count unique values based on multiple columns?

I have datafarme df:

id name number
1 sam   76
2 sam    8
2 peter  8 
4 jack   2

I would like to group by on 'id' column and count the number of unique values based on the pair of (name,number)?

id count(name-number)
1    1
2    2
4    1     

I have tried this, but it does not work:

df.groupby('id')[('number','name')].nunique().reset_index()

Upvotes: 5

Views: 29314

Answers (4)

stedes
stedes

Reputation: 1571

You can just combine two groupbys to get the desired result.

import pandas
df = pandas.DataFrame({"id": [1, 2, 2, 4], "name": ["sam", "sam", "peter", "jack"], "number": [8, 8, 8, 2]})
group = df.groupby(['id','name','number']).size().groupby(level=0).size()

The first groupby will count the complete set of original combinations (and thereby make the columns you want to count unique). The second groupby will count the unique occurences per the column you want (and you can use the fact that the first groupby put that column in the index).

The result will be a Series. If you want to have DataFrame with the right column name (as you showed in your desired result) you can use the aggregate function:

group = df.groupby(['id','name','number']).size().groupby(level=0).agg({'count(name-number':'size'})

Upvotes: 9

sparrow
sparrow

Reputation: 11460

To get a list of unique values for column combinations:

grouped= df.groupby('name').number.unique()
for k,v in grouped.items():
    print(k)
    print(v)

output:

jack
[2]
peter
[8]
sam
[76  8]

To get number of values of one column based on another:

df.groupby('name').number.value_counts().unstack().fillna(0)

output:

number  2   8   76
name            
jack    1.0 0.0 0.0
peter   0.0 1.0 0.0
sam     0.0 1.0 1.0

Upvotes: 1

Shen Huang
Shen Huang

Reputation: 1

try

 df.groupby('id').apply(lambda x: x.drop('id', 
  axis=1).drop_duplicates().shape[0]).reset_index()

Upvotes: 0

mvd
mvd

Reputation: 1199

You can do:

import pandas
df = pandas.DataFrame({"id": [1, 2, 3, 4], "name": ["sam", "sam", "peter", "jack"], "number": [8, 8, 8, 2]})
g = df.groupby(["name", "number"])
print g.groups

which gives:

{('jack', 2): [3], ('peter', 8): [2], ('sam', 8): [0, 1]}

to get number of unique entries per pair you can do:

for p in g.groups: 
    print p, " has ", len(g.groups[p]), " entries"

which gives:

('peter', 8)  has  1  entries
('jack', 2)  has  1  entries
('sam', 8)  has  2  entries

update:

the OP asked for result in dataframe. One way to get this is to use aggregate with the length function, which will return a dataframe with the number of unique entries per pair:

d = g.aggregate(len)
print d.reset_index().rename(columns={"id": "num_entries"})

gives:

    name  number  num_entries
0   jack       2           1
1  peter       8           1
2    sam       8           2

Upvotes: 5

Related Questions