Picking random elements from groupby using pandas

Question

I have dataframe which looks like this:

    revisionId  itemId wikidataType
1    307190482      23           Q5
6    305019084      80           Q5
8    303692414     181           Q5
9    306600439     192           Q5
11   294597048     206           Q5

In complete dataframe, there are 100 such different values present in column wikidataType. Its a large dataframe, so I want to restrict it to 1000 records per wikidataType. Hence, I used following thing:

df = df[df.groupby('wikidataType')['wikidataType'].cumcount() < 1000]

This gives e like first 1000 records for each wikidataType. I want to choose these 1000 records randomly. So I tried using

df = df[random.sample(list(df.groupby('wikidataType')['wikidataType']), 1000)]

But gave an error as:

TypeError: 'Series' objects are mutable, thus they cannot be hashed

I even tried

 df = df[df.groupby('wikidataType')['wikidataType'].cumcount().random() < 1000]

But that also didn't work. Anyone know how can I do this?

Thanks in advance.

cs95 · Accepted Answer

A simpler method that I'd recommend, if you want the first 1000 elements, would be using groupby + head:

df = df.groupby('wikidataType').head(1000)

If you want 1000 random elements, call sample:

df = df.groupby('wikidataType', group_keys=False)\
                           .apply(lambda x: x.sample(1000))

You could choose to specify a fraction instead:

df = df.groupby('wikidataType', group_keys=False)\
                           .apply(lambda x: x.sample(frac=len(x) * .1))

Which gives you 10% of each element type. This would help if your population sizes vary, or if you have lesser than 1000 elements in any group.

A slight modification to this method, based on your comment, would be:

df = df.groupby('wikidataType', group_keys=False)\
               .apply(lambda x: x.sample(1000) if len(x) > 1000 else x)

Picking random elements from groupby using pandas

Answers (2)

Related Questions