Lenny
Lenny

Reputation: 29

pandas df add new column based on proportion of two other columns from another dataframe

I have df1 which has three columns (loadgroup, cartons, blocks) like this

loadgroup cartons blocks cartonsPercent blocksPercent
1 2269 14 26% 21%
2 1168 13 13% 19%
3 937 8 11% 12%
4 2753 24 31% 35%
5 1686 9 19% 13%
total(sum of column) 8813 68 100% 100%

The interpretation is like this: out of df1 26% cartons which is also 21% of blocks are assigned to loadgroup 1, etc. we can assume blocks are 1 to 68, cartons are 1 to 8813. I also have df2 which also has cartons and blocks columns. but does not have loadgroup. My goal is to assign loadgroup (1-5 as well) to df2 (100 blocks 29608 cartons in total), but keep the proportions, for example, for df2, 26% cartons 21% blocks assign loadgroup 1, 13% cartons 19% blocks assign loadgroup 2, etc. df2 is like this:

block cartons
0 533
1 257
2 96
3 104
4 130
5 71
6 68
7 87
8 99
9 51
10 291
11 119
12 274
13 316
14 87
15 149
16 120
17 222
18 100
19 148
20 192
21 188
22 293
23 120
24 224
25 449
26 385
27 395
28 418
29 423
30 244
31 327
32 337
33 249
34 528
35 528
36 494
37 540
38 368
39 533
40 614
41 462
42 350
43 618
44 463
45 552
46 397
47 401
48 397
49 365
50 475
51 379
52 541
53 488
54 383
55 354
56 760
57 327
58 211
59 356
60 552
61 401
62 320
63 368
64 311
65 421
66 458
67 278
68 504
69 385
70 242
71 413
72 246
73 465
74 386
75 231
76 154
77 294
78 275
79 169
80 398
81 227
82 273
83 319
84 177
85 272
86 204
87 139
88 187
89 263
90 90
91 134
92 67
93 115
94 45
95 65
96 40
97 108
98 60
99 102
total 100 blocks 29608 cartons

I want to add loadgroup column to df2, try to keep those proportions as close as possible. How to do it please? Thank you very much for the help.

Upvotes: -1

Views: 296

Answers (1)

Lenny
Lenny

Reputation: 29

I don't know how to find loadgroup column based on both cartons percent and blocks percent. But generate random loadgroup based on either cartons percent or blocks percent is easy.

Here is what I did. I generate 100,000 seeds first, then for each seed, I add column loadgroup1 based on cartons percent, loadgroup2 based on blocks percent, then calculate both percentages, then compare with df1 percentages, get absolute difference, record it. For these 100,000 seeds, I take the minimum difference one as my solution, which is sufficient for my job.

But this is not the optimal solution, and I am looking for quick and easy way to do this. Hope somebody can help. Here is my code.

df = pd.DataFrame()    
np.random.seed(10000)    
seeds = np.random.randint(1, 1000000, size = 100000)    
for i in range(46530, 46537):
print(seeds[i])
np.random.seed(seeds[i])
df2['loadGroup1'] = np.random.choice(df1.loadgroup, len(df2), p = df1.CartonsPercent)
df2['loadGroup2'] = np.random.choice(df1.loadgroup, len(df2), p = df1.blocksPercent)

df2.reset_index(inplace = True)
three = pd.DataFrame(df2.groupby('loadGroup1').agg(Cartons = ('cartons', 'sum'), blocks = ('block', 'count')))
three['CartonsPercent'] = three.Cartons/three.Cartons.sum()
three['blocksPercent'] = three.blocks/three.blocks.sum()
four = df1[['CartonsPercent','blocksPercent']] - three[['CartonsPercent','blocksPercent']]
four = four.abs()

subdf = pd.DataFrame({'i':[i],'Seed':[seeds[i]], 'Percent':['CartonsPercent'], 'AbsDiff':[four.sum().sum()]})
df = pd.concat([df,subdf])

three = pd.DataFrame(df2.groupby('loadGroup2').agg(Cartons = ('cartons', 'sum'), blocks = ('block', 'count')))
three['CartonsPercent'] = three.Cartons/three.Cartons.sum()
three['blocksPercent'] = three.blocks/three.blocks.sum()
four = df1[['CartonsPercent','blocksPercent']] - three[['CartonsPercent','blocksPercent']]
four = four.abs()

subdf = pd.DataFrame({'i':[i],'Seed':[seeds[i]], 'Percent':['blocksPercent'], 'AbsDiff':[four.sum().sum()]})
df = pd.concat([df,subdf])
df.sort_values(by = 'AbsDiff', ascending = True, inplace = True)
df = df.head(10)

Actually the first row of df will tell me the seed I am looking for, I kept 10 rows just for curiosity.

Here is my solution.

block cartons loadgroup
0 533 4
1 257 1
2 96 4
3 104 4
4 130 4
5 71 2
6 68 1
7 87 4
8 99 4
9 51 4
10 291 4
11 119 2
12 274 2
13 316 4
14 87 4
15 149 5
16 120 3
17 222 2
18 100 2
19 148 2
20 192 3
21 188 4
22 293 1
23 120 2
24 224 4
25 449 1
26 385 5
27 395 3
28 418 1
29 423 4
30 244 5
31 327 1
32 337 5
33 249 4
34 528 1
35 528 1
36 494 5
37 540 3
38 368 2
39 533 4
40 614 5
41 462 4
42 350 5
43 618 4
44 463 2
45 552 1
46 397 3
47 401 3
48 397 1
49 365 1
50 475 4
51 379 1
52 541 1
53 488 2
54 383 2
55 354 1
56 760 5
57 327 4
58 211 2
59 356 5
60 552 4
61 401 1
62 320 1
63 368 3
64 311 3
65 421 2
66 458 5
67 278 4
68 504 5
69 385 4
70 242 4
71 413 1
72 246 2
73 465 5
74 386 4
75 231 1
76 154 4
77 294 4
78 275 1
79 169 4
80 398 4
81 227 4
82 273 1
83 319 3
84 177 4
85 272 5
86 204 3
87 139 1
88 187 4
89 263 4
90 90 4
91 134 4
92 67 3
93 115 3
94 45 2
95 65 2
96 40 4
97 108 2
98 60 2
99 102 1

Here are the summaries.

loadgroup cartons blocks cartonsPercent blocksPercent
1 7610 22 26% 22%
2 3912 18 13% 18%
3 3429 12 12% 12%
4 9269 35 31% 35%
5 5388 13 18% 13%

It's very close to my target though.

Upvotes: 0

Related Questions