Reputation: 1075
I have dataframe which looks like this:
revisionId itemId wikidataType
1 307190482 23 Q5
6 305019084 80 Q5
8 303692414 181 Q5
9 306600439 192 Q5
11 294597048 206 Q5
In complete dataframe, there are 100 such different values present in column wikidataType. Its a large dataframe, so I want to restrict it to 1000 records per wikidataType. Hence, I used following thing:
df = df[df.groupby('wikidataType')['wikidataType'].cumcount() < 1000]
This gives e like first 1000 records for each wikidataType. I want to choose these 1000 records randomly. So I tried using
df = df[random.sample(list(df.groupby('wikidataType')['wikidataType']), 1000)]
But gave an error as:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
I even tried
df = df[df.groupby('wikidataType')['wikidataType'].cumcount().random() < 1000]
But that also didn't work. Anyone know how can I do this?
Thanks in advance.
Upvotes: 1
Views: 1910
Reputation: 20694
In the new version of pandas you can simply do:
df = df.groupby('wikidataType').sample(1000)
Highly recommended because much simpler.
Upvotes: 1
Reputation: 402353
A simpler method that I'd recommend, if you want the first 1000 elements, would be using groupby
+ head
:
df = df.groupby('wikidataType').head(1000)
If you want 1000 random elements, call sample
:
df = df.groupby('wikidataType', group_keys=False)\
.apply(lambda x: x.sample(1000))
You could choose to specify a fraction instead:
df = df.groupby('wikidataType', group_keys=False)\
.apply(lambda x: x.sample(frac=len(x) * .1))
Which gives you 10% of each element type. This would help if your population sizes vary, or if you have lesser than 1000 elements in any group.
A slight modification to this method, based on your comment, would be:
df = df.groupby('wikidataType', group_keys=False)\
.apply(lambda x: x.sample(1000) if len(x) > 1000 else x)
Upvotes: 4