Reputation:
I have this 5000 rows dataframe.
I want to make 4 random samples of 300 rows from the dataframe. I want each of my sample to have no duplicate inside the sample, but i also want no duplicate among samples. Ie i dont want a row to appear in sample 1 and sample 3 for example.
I have tried df.sample(300,replace=False)
but it's not enough.
I have also searched the forum but didnt find what i want.
How can i code pandas to do so without doing batch groups?
Upvotes: 1
Views: 1098
Reputation: 880
I don't think there is a pandas function specifically for that, but how about doing this:
df = pd.DataFrame({"col": range(5000)})
sample = df.sample(1200, replace= False)
sample.duplicated().any()
>> False # <-- no duplicates
samples = [sample.iloc[i-300:i] for i in range(300, 1500, 300)] # <-- 4 samples
Considering that .sample will return a random selection without replacement, this would achieve what you want.
Upvotes: 1