Nilakshi Naphade
Nilakshi Naphade

Reputation: 1075

Picking random elements from groupby using pandas

I have dataframe which looks like this:

    revisionId  itemId wikidataType
1    307190482      23           Q5
6    305019084      80           Q5
8    303692414     181           Q5
9    306600439     192           Q5
11   294597048     206           Q5

In complete dataframe, there are 100 such different values present in column wikidataType. Its a large dataframe, so I want to restrict it to 1000 records per wikidataType. Hence, I used following thing:

df = df[df.groupby('wikidataType')['wikidataType'].cumcount() < 1000]

This gives e like first 1000 records for each wikidataType. I want to choose these 1000 records randomly. So I tried using

df = df[random.sample(list(df.groupby('wikidataType')['wikidataType']), 1000)]

But gave an error as:

TypeError: 'Series' objects are mutable, thus they cannot be hashed

I even tried

 df = df[df.groupby('wikidataType')['wikidataType'].cumcount().random() < 1000]

But that also didn't work. Anyone know how can I do this?

Thanks in advance.

Upvotes: 1

Views: 1910

Answers (2)

Philipp Schwarz
Philipp Schwarz

Reputation: 20694

In the new version of pandas you can simply do:

df = df.groupby('wikidataType').sample(1000)

Highly recommended because much simpler.

Upvotes: 1

cs95
cs95

Reputation: 402353

A simpler method that I'd recommend, if you want the first 1000 elements, would be using groupby + head:

df = df.groupby('wikidataType').head(1000)

If you want 1000 random elements, call sample:

df = df.groupby('wikidataType', group_keys=False)\
                           .apply(lambda x: x.sample(1000))

You could choose to specify a fraction instead:

df = df.groupby('wikidataType', group_keys=False)\
                           .apply(lambda x: x.sample(frac=len(x) * .1)) 

Which gives you 10% of each element type. This would help if your population sizes vary, or if you have lesser than 1000 elements in any group.


A slight modification to this method, based on your comment, would be:

df = df.groupby('wikidataType', group_keys=False)\
               .apply(lambda x: x.sample(1000) if len(x) > 1000 else x) 

Upvotes: 4

Related Questions