Kiran
Kiran

Reputation: 159

How to apply random sampling using multiple criteria's for selection in Pandas?

In my datasets, I have the ID, gender, village and crop area in hectare from different farmers.

I have to create a group of 5 hectares farmers for one group using crop area. In each group four farmers will be selected randomly, but minimum 25% of women farmers crop area has to be selected in each group randomly.

I am trying to find out how, but I am stuck in getting the correct solution.

Here is my dummy data set:

    Farmer_id Gender Village  Crop_area
0           1      F  Nashik       1.00
1           2      F  Nashik       0.50
2           3      M  Nashik       1.00
3           4      M  Nashik       0.80
4           5      M  Nashik       0.60
5           6      M  Nashik       0.10
6           7      M  Nashik       1.00
7           8      F  Nashik       0.60
8           9      F  Nashik       1.00
9          10      F  Nashik       0.29
10         11      M  Nashik       0.70
11         12      M  Nashik       1.00
12         13      M  Nashik       0.41
13         14      M  Nashik       1.00 

Here is what I have so far:

df['Crop_Area_Cum'] = df['Crop_area'].cumsum()
grouped = df.groupby(df.Gender)
df_male = grouped.get_group("M")
df_female = grouped.get_group("F")
df['Sample']=4
df['Selected Farmers'] = df['Sample'].apply(np.ceil).astype(int)
df['Selected Farmers'] = df.groupby('Gender').apply(lambda df: df['Village'].sample(df['Selected Farmers'].iat[0])).reset_index(level=0)['Village']
df['Selected Farmers'] = df['Selected Farmers'].fillna('')

    Farmer_id Gender        ...        Sample  Selected Farmers
0           1      F        ...             4            Nashik
1           2      F        ...             4            Nashik
2           3      M        ...             4                  
3           4      M        ...             4                  
4           5      M        ...             4                  
5           6      M        ...             4            Nashik
6           7      M        ...             4            Nashik
7           8      F        ...             4            Nashik
8           9      F        ...             4            Nashik
9          10      F        ...             4                  
10         11      M        ...             4                  
11         12      M        ...             4                  
12         13      M        ...             4            Nashik
13         14      M        ...             4            Nashik

The output is not correct, because none of the criteria is followed for sampling.

Actual output required: enter image description here

Upvotes: 1

Views: 578

Answers (2)

sitting_duck
sitting_duck

Reputation: 3720

The first idea is to assign a random group number to each row. then sum that up to see if each area is close to 5. Keep doing that until that condition is true. Then random selection flags are assigned to each row such that each group has 4 Trues and the rest Falses.

Then next step is to keep assigning those random selection flags until the female(s) selected represent at least 25% of the total female land ownership. Keep running the code until a solution is achieved.

nb_groups = 2
min_acceptable_female_prp = 0.25

The random selector function:

def rnd_sel(x):
    arr = [True]*4+[False]*(len(x)-4)
    np.random.shuffle(arr)
    return arr

The main processing loop:

for i in range(100):
    dfg = df.assign(Group=random.choices(range(1,nb_groups+1), k=len(df)))

    if (dfg.groupby('Group').sum()['Crop_area']-5).abs().max() < 0.5:
        dfg.sort_values(['Group','Gender'], inplace=True)

        print(f'\nGroup area solve iteration: {i}\n')
        dfg_s = dfg.groupby(['Group','Gender']).sum()
        dfg_s['Tot_Grp_Crop_area'] = dfg_s.groupby('Group')['Crop_area'].transform(sum)
        print(dfg_s)    
        
        # make sure females are present in each group
        if len(dfg_s.loc[pd.IndexSlice[:, 'F'], :]) == nb_groups:
            dfg['Selected'] = dfg.groupby('Group')['Group'].transform(rnd_sel)
            
            print()
            print(dfg)
            
            dfg_sf = dfg[dfg['Selected']].groupby(['Group','Gender']).sum()  
            print(dfg_sf)
            
            if len(dfg_sf.loc[pd.IndexSlice[:, 'F'], :]) == nb_groups:
                dfg_s['Gender_Selected_area'] = dfg_sf['Crop_area']
                
                dfg_s['Gender_Selected_area_prp'] = dfg_s['Gender_Selected_area']/dfg_s['Crop_area']
                print(dfg_s)
                
                min_female_prp = dfg_s.loc[pd.IndexSlice[:, 'F'], :]['Gender_Selected_area_prp'].min()  
                
                if min_female_prp >= min_acceptable_female_prp:
                    print(f'\nSolution achieved with minimum female crop area representation of {min_female_prp*100:.1f}%')
                else:
                    print('*** solution not achieved')
    
    break

Associated output:

Group area solve iteration: 0

              Farmer_id  Crop_area  Tot_Grp_Crop_area
Group Gender                                         
1     F              11       2.10               5.41
      M              38       3.31               5.41
2     F              19       1.29               4.59
      M              37       3.30               4.59

    Farmer_id Gender Village  Crop_area  Group  Selected
0           1      F  Nashik       1.00      1     False
1           2      F  Nashik       0.50      1      True
7           8      F  Nashik       0.60      1      True
2           3      M  Nashik       1.00      1      True
3           4      M  Nashik       0.80      1      True
5           6      M  Nashik       0.10      1     False
11         12      M  Nashik       1.00      1     False
12         13      M  Nashik       0.41      1     False
8           9      F  Nashik       1.00      2      True
9          10      F  Nashik       0.29      2     False
4           5      M  Nashik       0.60      2      True
6           7      M  Nashik       1.00      2     False
10         11      M  Nashik       0.70      2      True
13         14      M  Nashik       1.00      2      True
              Farmer_id  Crop_area  Selected
Group Gender                                
1     F              10        1.1         2
      M               7        1.8         2
2     F               9        1.0         1
      M              30        2.3         3
              Farmer_id  Crop_area  Tot_Grp_Crop_area  Gender_Selected_area  \
Group Gender                                                                  
1     F              11       2.10               5.41                   1.1   
      M              38       3.31               5.41                   1.8   
2     F              19       1.29               4.59                   1.0   
      M              37       3.30               4.59                   2.3   

              Gender_Selected_area_prp  
Group Gender                            
1     F                       0.523810  
      M                       0.543807  
2     F                       0.775194  
      M                       0.696970  

Solution achieved with minimum female crop area representation of 52.4%

Upvotes: 1

J_H
J_H

Reputation: 20505

There's more than one way to tackle this.

Rejection Sampling is perhaps the simplest.

Your new function wishes to return a suitable 4-tuple of farmers.

  1. Create an empty set().
  2. While length of set < 4:
    • Keep choosing a random id, and add it to the set.
  3. Now you have 4 candidate farmers. Compute summary statistics on gender and area.
  4. Decide whether these four are acceptable or should be rejected. Either return the four, or start from scratch at step (1.)

Upvotes: 2

Related Questions