Reputation: 15814
I'm trying out pandas for the first time. I have a dataframe with two columns: user_id
and string
. Each user_id may have several strings, thus showing up in the dataframe multiple times. I want to derive another dataframe from this; one where only those user_ids
are listed that have at least 2 or more strings
associated to them.
I tried df[df['user_id'].value_counts()> 1]
, which I thought was the standard way to do this, but it yields IndexingError: Unalignable boolean Series key provided
. Can someone clear out my concept and provide the correct alternative?
Upvotes: 4
Views: 11510
Reputation: 171
I had the same challenge and used:
df['user_id'].value_counts()[df['user_id'].value_counts() > 1]
Credits: blog.softhints
Upvotes: 0
Reputation: 1064
You can simply do the following,
col = 'column_name' # name of the column that you consider
n = 10 # how many occurrences expected to be appeared
df = df[df.groupby(col)[col].transform('count').ge(n)]
this should filter the dataframe as you need
Upvotes: 0
Reputation: 39
l2 = ((df.val1.loc[df.val== 'Best'].value_counts().sort_index()/df.val1.loc[df.val.isin(l11)].value_counts().sort_index())).loc[lambda x : x>0.5].index.tolist()
Upvotes: -1