Reputation: 15814

Filtering dataframe based on column value_counts (pandas)

I'm trying out pandas for the first time. I have a dataframe with two columns: user_id and string. Each user_id may have several strings, thus showing up in the dataframe multiple times. I want to derive another dataframe from this; one where only those user_ids are listed that have at least 2 or more strings associated to them.

I tried df[df['user_id'].value_counts()> 1], which I thought was the standard way to do this, but it yields IndexingError: Unalignable boolean Series key provided. Can someone clear out my concept and provide the correct alternative?

Upvotes: 4

Answers (4)

afrologicinsect

Reputation: 171

I had the same challenge and used:

df['user_id'].value_counts()[df['user_id'].value_counts() > 1]

Credits: blog.softhints

Upvotes: 0

Amila Viraj

Reputation: 1064

You can simply do the following,

col = 'column_name'   # name of the column that you consider
n = 10                # how many occurrences expected to be appeared

df = df[df.groupby(col)[col].transform('count').ge(n)]

this should filter the dataframe as you need

Upvotes: 0

Aaka sh

Reputation: 39

l2 = ((df.val1.loc[df.val== 'Best'].value_counts().sort_index()/df.val1.loc[df.val.isin(l11)].value_counts().sort_index())).loc[lambda x : x>0.5].index.tolist()

Upvotes: -1

jezrael

Reputation: 862581

I think you need transform, because need same index of mask as df. But if use value_counts index is changed and it raise error.

df[df.groupby('user_id')['user_id'].transform('size') > 1]

Upvotes: 9

Filtering dataframe based on column value_counts (pandas)

Answers (4)

Related Questions