Reputation: 748
I have a pandas DataFrame
with 100,000 rows and want to split it into 100 sections with 1000 rows in each of them.
How do I draw a random sample of certain size (e.g. 50 rows) of just one of the 100 sections? The df is already ordered such that the first 1000 rows are from the first section, next 1000 rows from another, and so on.
Upvotes: 49
Views: 116791
Reputation: 738
You could add a "section"
column to your data then perform a groupby and sample:
import numpy as np
import pandas as pd
df = pd.DataFrame(
{"x": np.arange(1_000 * 100), "section": np.repeat(np.arange(100), 1_000)}
)
# >>> df
# x section
# 0 0 0
# 1 1 0
# 2 2 0
# 3 3 0
# 4 4 0
# ... ... ...
# 99995 99995 99
# 99996 99996 99
# 99997 99997 99
# 99998 99998 99
# 99999 99999 99
#
# [100000 rows x 2 columns]
sample = df.groupby("section").sample(50)
# >>> sample
# x section
# 907 907 0
# 494 494 0
# 775 775 0
# 20 20 0
# 230 230 0
# ... ... ...
# 99740 99740 99
# 99272 99272 99
# 99863 99863 99
# 99198 99198 99
# 99555 99555 99
#
# [5000 rows x 2 columns]
with additional .query("section == 42")
or whatever if you are interested in only a particular section.
Note this requires pandas 1.1.0, see the docs here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.sample.html
For older versions, see the answer by @msh5678
Upvotes: 8
Reputation: 31
Thank you, Jeff, But I received an error;
AttributeError: Cannot access callable attribute 'sample' of 'DataFrameGroupBy' objects, try using the 'apply' method
So I suggest instead of sample = df.groupby("section").sample(50)
using below command :
df.groupby('section').apply(lambda grp: grp.sample(50))
Upvotes: 3
Reputation: 964
This is a nice place for recursion.
def main2():
rows = 8 # say you have 8 rows, real data will need len(rows) for int
rands = []
for i in range(rows):
gen = fun(rands)
rands.append(gen)
print(rands) # now range through random values
def fun(rands):
gen = np.random.randint(0, 8)
if gen in rands:
a = fun(rands)
return a
else: return gen
if __name__ == "__main__":
main2()
output: [6, 0, 7, 1, 3, 5, 4, 2]
Upvotes: 0
Reputation: 375535
You can use the sample
method*:
In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], columns=["A", "B"])
In [12]: df.sample(2)
Out[12]:
A B
0 1 2
2 5 6
In [13]: df.sample(2)
Out[13]:
A B
3 7 8
0 1 2
*On one of the section DataFrames.
Note: If you have a larger sample size that the size of the DataFrame this will raise an error unless you sample with replacement.
In [14]: df.sample(5)
ValueError: Cannot take a larger sample than population when 'replace=False'
In [15]: df.sample(5, replace=True)
Out[15]:
A B
0 1 2
1 3 4
2 5 6
3 7 8
1 3 4
Upvotes: 73
Reputation: 471
One solution is to use the choice
function from numpy.
Say you want 50 entries out of 100, you can use:
import numpy as np
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed = df.iloc[chosen_idx]
This is of course not considering your block structure. If you want a 50 item sample from block i
for example, you can do:
import numpy as np
block_start_idx = 1000 * i
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed_from_block_i = df.iloc[block_start_idx + chosen_idx]
Upvotes: 12