Danqi Liu
Danqi Liu

Reputation: 53

Python value counts and return the other columns in pandas

I just started doing my first ML project with Python, and got stuck with one issue. enter image description here My project is to analyze some animal shelter information. Above is the list of animals have been bought to the shelter. I want to see which animals(based on animal ID) have been bought to the shelter more than 3 times, and then return the dataframe with animal ID, breed, MonthYear information etc. My first attempt:

aac[aac['Animal ID'].value_counts()>3]

After this code, Python tells me: IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match

My second attempt:

temp=df.groupby('Animal ID').count() 
temp[temp['Name']>3]

Now I get the following output : enter image description here

Now I do have the list of animal ID who has been returned to shelter more than 3 times. But I miss the other breed/datatime/monthyear ...information. My question is what is the best way to link animal ID and the other information: breed, monthyear, animaltype etc together. So I can further analyze the returned animals.

Thanks a lot for helping me out!

Upvotes: 2

Views: 2155

Answers (2)

ndr
ndr

Reputation: 1437

You could create another column with counts, and use it for filtering.

df = pandas.DataFrame({'col1': range(10), 'col2': range(10, 20), 'col3': ['A', 'A', 'B', 'A', 'B', 'A', 'B', 'C', 'C', 'B']})

df = df.merge(df.groupby('col3').apply(len).to_frame(name='counts'), left_on='col3', right_index=True)

df[df.counts > 2]

Upvotes: 0

jpp
jpp

Reputation: 164693

A groupby operation is not required here.

You can create a value_counts series, filter it for items which occur more than 3 times, and then use this to filter your original dataframe.

This is a minimal example:

df = pd.DataFrame({'col1': range(10), 'col2': range(10, 20)},
                  index=['A', 'A', 'B', 'A', 'B', 'A', 'B', 'C', 'C', 'B'])

c = df.index.value_counts()
idx = c[c > 3].index

res = df.loc[df.index.isin(idx)]

print(res)

   col1  col2
A     0    10
A     1    11
B     2    12
A     3    13
B     4    14
A     5    15
B     6    16
B     9    19

Upvotes: 3

Related Questions