Reputation: 93
I'm working with the following panel data in a MultiIndex
pandas DataFrame
called df_data
:
y x
n time
0 0 0.423607 -0.307983
1 0.565563 -0.333430
2 0.735979 -0.453137
3 0.962857 1.671106
1 0 0.772304 1.221366
1 0.455327 -1.024852
2 0.864768 0.609867
3 0.334429 -2.567936
2 0 0.435553 -0.259228
1 0.221501 0.484677
2 0.773628 0.650288
3 0.293902 0.566452
n
indexes an individual (there are 500 of them), t
indexes time. It's a balanced panel. I would like to create a random sample of nn=100
individuals with replacement. Also, if an individual makes it into the random sample, then all 4 time observations (t=0,1,2,3) for this individual should be assigned to the sample.
The following line is doing almost what I want:
df_sample = df_data.loc[np.random.randint(3, size=100).tolist()]
However, it does not sample an individual repeatedly. So if the created list of random variables is say [2, 3, 2, 4, 1, ...] then the third individual (index =2 is the third individual) is only selected once and not twice into the random sample. This means that as soon as the random vector above contains the same individual more than once, I end up with fewer than 100 individuals (with 4 time observations each) in the random sample.
I also tried the df_data.sample
function, but I doesn't seem to be able to handle the specific multilevel index I have here in the panel.
I could write all kinds of loops to get this done, but I thought there should be a simpler (and faster) way of doing this.
I'm on Python 3.5 and I'm using pandas version 0.17.1.
Thanks.
Upvotes: 9
Views: 3682
Reputation: 109
This worked for me, it sort of a compination fo the other answers:
subindex = df.index.get_level_values('id')
sample_ids = np.random.choice(subindex, 5, replace=False)
sample = df.loc[sample_ids]
I was using a df with an index of ["id", "other"]. It returned a sample of 5 id's and all their associated "others".
Upvotes: 3
Reputation:
A simple solution:
subindex = df.index.get_level_values('sub_index')
sample_ids = np.random.choice(subindex, 5, replace=True)
sample = df[subindex.isin(sample_ids)].copy()
Upvotes: 5
Reputation: 42875
You can use itertools.product
to quickly produce the format needed to select with duplicates from a MultiIndex
:
Sample data:
from itertools import product
individuals = list(range(500))
time = (0, 1, 2, 3,)
index = pd.MultiIndex.from_tuples(list(product(individuals, time)))
df = pd.DataFrame(data={'A': np.random.random(size=2000), 'B': np.random.random(size=2000)}, index=index)
A B
0 0 0.208461 0.842118
1 0.481681 0.096121
2 0.420538 0.922363
3 0.859182 0.078940
1 0 0.171162 0.255883
1 0.338864 0.975492
2 0.270533 0.504605
3 0.691041 0.709189
2 0 0.220405 0.925001
1 0.811951 0.479795
2 0.010527 0.534866
3 0.561204 0.915972
3 0 0.813726 0.083478
1 0.745100 0.462120
2 0.189111 0.552039
3 0.006141 0.622969
Combine the result of np.random.randint
with the time
values using product
:
sample_ix = np.random.randint(low=0, high=500, size=100)
len(np.unique(sample_ix))
91
sample_multi_ix = list(product(sample_ix, time))
[(55, 0), (55, 1), (55, 2), (55, 3), (254, 0), (254, 1), (254, 2), (254, 3), ...]
and select accordingly:
sample = df.loc[sample_multi_ix, :]
sample.info()
MultiIndex: 400 entries, (55, 0) to (135, 3)
Data columns (total 2 columns):
A 400 non-null float64
B 400 non-null float64
dtypes: float64(2)
memory usage: 9.4+ KB
If you want a unique sample
index
, you can add:
sample.index = pd.MultiIndex.from_tuples(list(product(list(range(100)), time)))
MultiIndex: 400 entries, (0, 0) to (99, 3)
Data columns (total 2 columns):
A 400 non-null float64
B 400 non-null float64
dtypes: float64(2)
Upvotes: 4