Danish
Danish

Reputation: 2871

Random Data generation based on condition in pandas and numpy

I have a data frame as shown below.

ID      
1    
2    
3    
4    
5    
6    
7    
8    
9    
10   
11   
12   
13   
14   
15   
16  
17   
18   
19   
20

Which has only one column ID and 20 unique values. randomly, I want to pick 25% of the unique values of column ID and create a new column OWNER_ID by randomly populating that across 20 rows with 10% missing (2 rows).

The randomly picked ID and Owner_ID should match. For example if we randomly picked 2 as one of the Owner_ID. then whenever ID is 2, Owner_ID should be 2

For example randomly I picked 2,3,8,9,11

The expected output:

ID   OWNERD_ID  
1    2
2    2
3    3
4    11
5    9
6    11
7    11
8    8
9    9
10   2
11   11
12   2
13   na
14   8
15   9
16   8
17   9
18   2
19   2
20   na

I just don't know how start for this. So I did not tried anything. I am just learning random data generation using pandas.

Upvotes: 1

Views: 108

Answers (1)

anky
anky

Reputation: 75100

May be you can try a custom function like:

def myfunc(d):
    s=d.sample(frac=.25)
    d=d.assign(owner_id=s)
    fill_na=pd.Series(np.random.choice(d['owner_id'].dropna(), size=len(df))) #thanks @jezrael
    d['owner_id']=d['owner_id'].fillna(fill_na)
    d.loc[d.sample(frac=.10).index,'owner_id']=np.nan
    return d
myfunc(df)

    ID  owner_id
0    1       3.0
1    2      19.0
2    3       3.0
3    4       3.0
4    5       5.0
5    6       3.0
6    7       8.0
7    8       8.0
8    9       NaN
9   10       3.0
10  11       5.0
11  12       3.0
12  13      19.0
13  14       9.0
14  15       5.0
15  16      19.0
16  17       NaN
17  18       9.0
18  19      19.0
19  20       9.0

Upvotes: 1

Related Questions