Reputation: 8000
I am trying to sample data from a big dataset.
The dataset is like
id label
1 A
2 B
3 C
4 A
.........
Code to generate a sample dataset
labels = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N']
df = pd.DataFrame()
N = 300000
weights = [0.350019, 0.209966, 0.126553, 0.100983, 0.053767, 0.039378, 0.029529,
0.019056, 0.016783, 0.014813, 0.014152, 0.013477, 0.009444, 0.002082]
import random
df['id'] = list(range(1, N+1))
df['label'] = list(random.choices(labels, weights=weights, k=N))
group_dict= df.groupby(['id']).apply(lambda x: list(set(x['label'].tolist()))[0]).to_dict()
df = pd.DataFrame(group_dict.items())
df.columns= ['id','label']
The distribution of labels in the dataset is
df['label'].value_counts(normalize=True)
A 0.350373
B 0.209707
C 0.126307
D 0.101353
E 0.053917
F 0.039487
G 0.029217
H 0.018780
I 0.016510
J 0.015083
K 0.014323
L 0.013467
M 0.009530
N 0.001947
I created a new column in the dataset
df['freq'] = df.groupby('label')['label'].transform('count')
When I am trying to sample say 5000
items
sampledf = df.sample(n=5000, weights=df.freq,
random_state=42)
The distribution of the labels in the sampledf
is not same as that in the df
A 0.6048
B 0.2198
C 0.0850
D 0.0544
E 0.0190
F 0.0082
G 0.0038
H 0.0020
I 0.0010
K 0.0008
L 0.0008
J 0.0004
I am not sure why the distribution is not the same as the actual data frame.
Can anybody help me with what I am missing here?
Thanks
Upvotes: 3
Views: 1072
Reputation: 15452
If you're re-assigning frequency to the original dataframe, that's probably the issue. Make sure you don't have duplicate labels and weights going into your sampling.
Using your summary data I can generate 5000 samples which do have (roughly) the same distribution as the original:
In [1]: import pandas as pd
In [2]: summary = pd.DataFrame(
...: [
...: ['A', 0.350019],
...: ['B', 0.209966],
...: ['C', 0.126553],
...: ['D', 0.100983],
...: ['E', 0.053767],
...: ['F', 0.039378],
...: ['G', 0.029529],
...: ['H', 0.019056],
...: ['I', 0.016783],
...: ['J', 0.014813],
...: ['K', 0.014152],
...: ['L', 0.013477],
...: ['M', 0.009444],
...: ['N', 0.002082],
...: ],
...: columns=['label', 'freq']
...: )
You can sample from the summary table, weighting each unique label with the frequency in the original dataset:
In [3]: summary.label.sample(
...: n=5000,
...: weights=summary.freq,
...: replace=True,
...: ).value_counts(normalize=True)
Out[3]:
label
A 0.3448
B 0.2198
C 0.1356
D 0.0952
E 0.0488
F 0.0322
G 0.0284
H 0.0234
I 0.0168
J 0.0162
K 0.0146
L 0.0140
M 0.0090
N 0.0012
dtype: float64
Alternatively, you could simply skip the calculation of the frequencies altogether - pandas will do this for you:
In [7]: df = pd.DataFrame(np.random.choice(["A", "B", "C", "D"], size=20_000, p=[0.6, 0.3, 0.05, 0.05]), columns=["label"])
In [8]: df.label.sample(5000, replace=True).value_counts(normalize=True)
Out[8]:
A 0.5994
B 0.2930
C 0.0576
D 0.0500
Name: label, dtype: float64
The issue with the code in your question is that you end up weighting based on frequency and based on the explicit weights (which also account for frequency):
In [2]: df = pd.DataFrame(np.random.choice(["A", "B", "C", "D"], size=20_000, p=[0.6, 0.3, 0.05, 0.05]), columns=["label"])
In [3]: df['frequency'] = df.groupby('label')['label'].transform('count')
In [4]: df
Out[4]:
label frequency
0 A 11908
1 A 11908
2 B 5994
3 B 5994
4 D 1033
... ... ...
19995 A 11908
19996 D 1033
19997 A 11908
19998 A 11908
19999 A 11908
The result is roughly equal to the normalized square of each frequency:
In [6]: freqs = np.array([0.6, 0.3, 0.05, 0.05])
In [7]: (freqs ** 2) / (freqs ** 2).sum()
Out[7]:
array([0.79120879, 0.1978022 , 0.00549451, 0.00549451])
Upvotes: 3