user19413311
user19413311

Reputation:

pandas : sampling avoiding twice same values in different samples

I have this 5000 rows dataframe.

I want to make 4 random samples of 300 rows from the dataframe. I want each of my sample to have no duplicate inside the sample, but i also want no duplicate among samples. Ie i dont want a row to appear in sample 1 and sample 3 for example.

I have tried df.sample(300,replace=False) but it's not enough.

I have also searched the forum but didnt find what i want.

How can i code pandas to do so without doing batch groups?

Upvotes: 1

Views: 1098

Answers (1)

Stryder
Stryder

Reputation: 880

I don't think there is a pandas function specifically for that, but how about doing this:

df = pd.DataFrame({"col": range(5000)})

sample = df.sample(1200, replace= False)

sample.duplicated().any()
>> False                    # <-- no duplicates

samples = [sample.iloc[i-300:i] for i in range(300, 1500, 300)] # <-- 4 samples

Considering that .sample will return a random selection without replacement, this would achieve what you want.

Upvotes: 1

Related Questions