yes
yes

Reputation: 177

How can I select one of each category with a Pandas DataFrame in Python?

I have a dataframe that looks like this:

author|string
abc|hi
abc|yo
def|whats
ghi|up
ghi|dog

how can I select only one row per author? I'm at a loss. I want to do something like this:

df.loc[unique authors].sample(n=1000)

and get something like this:

author|string
abc|hi
def|whats
ghi|up

I was thinking of converting the author column to categories, but I don't know where to go from there.

I could just do something like this but it seems stupid.

author_list = df['author'].unique().tolist()
indexes = []
for author in author_list:
  indexes.append(df.loc[df['author'] == author].iloc[0].index)
df.iloc[indexes].sample(n=1000)

Upvotes: 0

Views: 1440

Answers (2)

Rodalm
Rodalm

Reputation: 5503

Use groupby + sample

# sample 1 'string' of each 'author' group 
res = df.groupby('author').sample(1)

Output

>>> for _ in range(3):
...     print(df.groupby('author').sample(1), '\n')


  author string
0    abc     hi
2    def  whats
3    ghi     up 

  author string
0    abc     hi
2    def  whats
3    ghi     up 

  author string
1    abc     yo
2    def  whats
4    ghi    dog 

Setup:

df = pd.DataFrame({
    'author': ['abc', 'abc', 'def', 'ghi', 'ghi'],
    'string': ['hi', 'yo', 'whats', 'up', 'dog']
})

Note that the random sampling is done for each column of each group separately. Since you are just sampling one column, it doesn't matter. But if your DataFrame has multiple columns and you want to sample a random row as a whole from each group use

res = df.groupby('author', group_keys=False).apply(pd.DataFrame.sample, n=1)

Upvotes: 1

BENY
BENY

Reputation: 323316

You can do

out = df.drop_duplicates('author')

Upvotes: 1

Related Questions