J Jung
J Jung

Reputation: 93

Sampling from MultiIndex DataFrame

I'm working with the following panel data in a MultiIndex pandas DataFrame called df_data:

              y         x
n   time                    
0   0     0.423607 -0.307983
    1     0.565563 -0.333430
    2     0.735979 -0.453137
    3     0.962857  1.671106
1   0     0.772304  1.221366
    1     0.455327 -1.024852
    2     0.864768  0.609867
    3     0.334429 -2.567936
2   0     0.435553 -0.259228
    1     0.221501  0.484677
    2     0.773628  0.650288
    3     0.293902  0.566452

n indexes an individual (there are 500 of them), t indexes time. It's a balanced panel. I would like to create a random sample of nn=100 individuals with replacement. Also, if an individual makes it into the random sample, then all 4 time observations (t=0,1,2,3) for this individual should be assigned to the sample.

The following line is doing almost what I want:

df_sample = df_data.loc[np.random.randint(3, size=100).tolist()]

However, it does not sample an individual repeatedly. So if the created list of random variables is say [2, 3, 2, 4, 1, ...] then the third individual (index =2 is the third individual) is only selected once and not twice into the random sample. This means that as soon as the random vector above contains the same individual more than once, I end up with fewer than 100 individuals (with 4 time observations each) in the random sample. I also tried the df_data.samplefunction, but I doesn't seem to be able to handle the specific multilevel index I have here in the panel. I could write all kinds of loops to get this done, but I thought there should be a simpler (and faster) way of doing this. I'm on Python 3.5 and I'm using pandas version 0.17.1. Thanks.

Upvotes: 9

Views: 3682

Answers (3)

VFR292
VFR292

Reputation: 109

This worked for me, it sort of a compination fo the other answers:

subindex = df.index.get_level_values('id')
sample_ids = np.random.choice(subindex, 5, replace=False)
sample = df.loc[sample_ids]

I was using a df with an index of ["id", "other"]. It returned a sample of 5 id's and all their associated "others".

Upvotes: 3

user267817
user267817

Reputation:

A simple solution:

subindex = df.index.get_level_values('sub_index')
sample_ids = np.random.choice(subindex, 5, replace=True)
sample = df[subindex.isin(sample_ids)].copy()

Upvotes: 5

Stefan
Stefan

Reputation: 42875

You can use itertools.product to quickly produce the format needed to select with duplicates from a MultiIndex:

Sample data:

from itertools import product
individuals = list(range(500))
time = (0, 1, 2, 3,)
index = pd.MultiIndex.from_tuples(list(product(individuals, time)))
df = pd.DataFrame(data={'A': np.random.random(size=2000), 'B': np.random.random(size=2000)}, index=index)

              A         B
0   0  0.208461  0.842118
    1  0.481681  0.096121
    2  0.420538  0.922363
    3  0.859182  0.078940
1   0  0.171162  0.255883
    1  0.338864  0.975492
    2  0.270533  0.504605
    3  0.691041  0.709189
2   0  0.220405  0.925001
    1  0.811951  0.479795
    2  0.010527  0.534866
    3  0.561204  0.915972
3   0  0.813726  0.083478
    1  0.745100  0.462120
    2  0.189111  0.552039
    3  0.006141  0.622969

Combine the result of np.random.randint with the time values using product:

sample_ix = np.random.randint(low=0, high=500, size=100)

len(np.unique(sample_ix))

91

sample_multi_ix = list(product(sample_ix, time))

[(55, 0), (55, 1), (55, 2), (55, 3), (254, 0), (254, 1), (254, 2), (254, 3), ...]

and select accordingly:

sample = df.loc[sample_multi_ix, :]
sample.info()

MultiIndex: 400 entries, (55, 0) to (135, 3)
Data columns (total 2 columns):
A    400 non-null float64
B    400 non-null float64
dtypes: float64(2)
memory usage: 9.4+ KB

If you want a unique sample index, you can add:

sample.index = pd.MultiIndex.from_tuples(list(product(list(range(100)), time))) 

MultiIndex: 400 entries, (0, 0) to (99, 3)
Data columns (total 2 columns):
A    400 non-null float64
B    400 non-null float64
dtypes: float64(2)

Upvotes: 4

Related Questions