Reputation: 29
I have df1 which has three columns (loadgroup, cartons, blocks) like this
loadgroup | cartons | blocks | cartonsPercent | blocksPercent |
---|---|---|---|---|
1 | 2269 | 14 | 26% | 21% |
2 | 1168 | 13 | 13% | 19% |
3 | 937 | 8 | 11% | 12% |
4 | 2753 | 24 | 31% | 35% |
5 | 1686 | 9 | 19% | 13% |
total(sum of column) | 8813 | 68 | 100% | 100% |
The interpretation is like this: out of df1 26% cartons which is also 21% of blocks are assigned to loadgroup 1, etc. we can assume blocks are 1 to 68, cartons are 1 to 8813. I also have df2 which also has cartons and blocks columns. but does not have loadgroup. My goal is to assign loadgroup (1-5 as well) to df2 (100 blocks 29608 cartons in total), but keep the proportions, for example, for df2, 26% cartons 21% blocks assign loadgroup 1, 13% cartons 19% blocks assign loadgroup 2, etc. df2 is like this:
block | cartons |
---|---|
0 | 533 |
1 | 257 |
2 | 96 |
3 | 104 |
4 | 130 |
5 | 71 |
6 | 68 |
7 | 87 |
8 | 99 |
9 | 51 |
10 | 291 |
11 | 119 |
12 | 274 |
13 | 316 |
14 | 87 |
15 | 149 |
16 | 120 |
17 | 222 |
18 | 100 |
19 | 148 |
20 | 192 |
21 | 188 |
22 | 293 |
23 | 120 |
24 | 224 |
25 | 449 |
26 | 385 |
27 | 395 |
28 | 418 |
29 | 423 |
30 | 244 |
31 | 327 |
32 | 337 |
33 | 249 |
34 | 528 |
35 | 528 |
36 | 494 |
37 | 540 |
38 | 368 |
39 | 533 |
40 | 614 |
41 | 462 |
42 | 350 |
43 | 618 |
44 | 463 |
45 | 552 |
46 | 397 |
47 | 401 |
48 | 397 |
49 | 365 |
50 | 475 |
51 | 379 |
52 | 541 |
53 | 488 |
54 | 383 |
55 | 354 |
56 | 760 |
57 | 327 |
58 | 211 |
59 | 356 |
60 | 552 |
61 | 401 |
62 | 320 |
63 | 368 |
64 | 311 |
65 | 421 |
66 | 458 |
67 | 278 |
68 | 504 |
69 | 385 |
70 | 242 |
71 | 413 |
72 | 246 |
73 | 465 |
74 | 386 |
75 | 231 |
76 | 154 |
77 | 294 |
78 | 275 |
79 | 169 |
80 | 398 |
81 | 227 |
82 | 273 |
83 | 319 |
84 | 177 |
85 | 272 |
86 | 204 |
87 | 139 |
88 | 187 |
89 | 263 |
90 | 90 |
91 | 134 |
92 | 67 |
93 | 115 |
94 | 45 |
95 | 65 |
96 | 40 |
97 | 108 |
98 | 60 |
99 | 102 |
total 100 blocks | 29608 cartons |
I want to add loadgroup column to df2, try to keep those proportions as close as possible. How to do it please? Thank you very much for the help.
Upvotes: -1
Views: 296
Reputation: 29
I don't know how to find loadgroup column based on both cartons percent and blocks percent. But generate random loadgroup based on either cartons percent or blocks percent is easy.
Here is what I did. I generate 100,000 seeds first, then for each seed, I add column loadgroup1 based on cartons percent, loadgroup2 based on blocks percent, then calculate both percentages, then compare with df1 percentages, get absolute difference, record it. For these 100,000 seeds, I take the minimum difference one as my solution, which is sufficient for my job.
But this is not the optimal solution, and I am looking for quick and easy way to do this. Hope somebody can help. Here is my code.
df = pd.DataFrame()
np.random.seed(10000)
seeds = np.random.randint(1, 1000000, size = 100000)
for i in range(46530, 46537):
print(seeds[i])
np.random.seed(seeds[i])
df2['loadGroup1'] = np.random.choice(df1.loadgroup, len(df2), p = df1.CartonsPercent)
df2['loadGroup2'] = np.random.choice(df1.loadgroup, len(df2), p = df1.blocksPercent)
df2.reset_index(inplace = True)
three = pd.DataFrame(df2.groupby('loadGroup1').agg(Cartons = ('cartons', 'sum'), blocks = ('block', 'count')))
three['CartonsPercent'] = three.Cartons/three.Cartons.sum()
three['blocksPercent'] = three.blocks/three.blocks.sum()
four = df1[['CartonsPercent','blocksPercent']] - three[['CartonsPercent','blocksPercent']]
four = four.abs()
subdf = pd.DataFrame({'i':[i],'Seed':[seeds[i]], 'Percent':['CartonsPercent'], 'AbsDiff':[four.sum().sum()]})
df = pd.concat([df,subdf])
three = pd.DataFrame(df2.groupby('loadGroup2').agg(Cartons = ('cartons', 'sum'), blocks = ('block', 'count')))
three['CartonsPercent'] = three.Cartons/three.Cartons.sum()
three['blocksPercent'] = three.blocks/three.blocks.sum()
four = df1[['CartonsPercent','blocksPercent']] - three[['CartonsPercent','blocksPercent']]
four = four.abs()
subdf = pd.DataFrame({'i':[i],'Seed':[seeds[i]], 'Percent':['blocksPercent'], 'AbsDiff':[four.sum().sum()]})
df = pd.concat([df,subdf])
df.sort_values(by = 'AbsDiff', ascending = True, inplace = True)
df = df.head(10)
Actually the first row of df will tell me the seed I am looking for, I kept 10 rows just for curiosity.
Here is my solution.
block | cartons | loadgroup |
---|---|---|
0 | 533 | 4 |
1 | 257 | 1 |
2 | 96 | 4 |
3 | 104 | 4 |
4 | 130 | 4 |
5 | 71 | 2 |
6 | 68 | 1 |
7 | 87 | 4 |
8 | 99 | 4 |
9 | 51 | 4 |
10 | 291 | 4 |
11 | 119 | 2 |
12 | 274 | 2 |
13 | 316 | 4 |
14 | 87 | 4 |
15 | 149 | 5 |
16 | 120 | 3 |
17 | 222 | 2 |
18 | 100 | 2 |
19 | 148 | 2 |
20 | 192 | 3 |
21 | 188 | 4 |
22 | 293 | 1 |
23 | 120 | 2 |
24 | 224 | 4 |
25 | 449 | 1 |
26 | 385 | 5 |
27 | 395 | 3 |
28 | 418 | 1 |
29 | 423 | 4 |
30 | 244 | 5 |
31 | 327 | 1 |
32 | 337 | 5 |
33 | 249 | 4 |
34 | 528 | 1 |
35 | 528 | 1 |
36 | 494 | 5 |
37 | 540 | 3 |
38 | 368 | 2 |
39 | 533 | 4 |
40 | 614 | 5 |
41 | 462 | 4 |
42 | 350 | 5 |
43 | 618 | 4 |
44 | 463 | 2 |
45 | 552 | 1 |
46 | 397 | 3 |
47 | 401 | 3 |
48 | 397 | 1 |
49 | 365 | 1 |
50 | 475 | 4 |
51 | 379 | 1 |
52 | 541 | 1 |
53 | 488 | 2 |
54 | 383 | 2 |
55 | 354 | 1 |
56 | 760 | 5 |
57 | 327 | 4 |
58 | 211 | 2 |
59 | 356 | 5 |
60 | 552 | 4 |
61 | 401 | 1 |
62 | 320 | 1 |
63 | 368 | 3 |
64 | 311 | 3 |
65 | 421 | 2 |
66 | 458 | 5 |
67 | 278 | 4 |
68 | 504 | 5 |
69 | 385 | 4 |
70 | 242 | 4 |
71 | 413 | 1 |
72 | 246 | 2 |
73 | 465 | 5 |
74 | 386 | 4 |
75 | 231 | 1 |
76 | 154 | 4 |
77 | 294 | 4 |
78 | 275 | 1 |
79 | 169 | 4 |
80 | 398 | 4 |
81 | 227 | 4 |
82 | 273 | 1 |
83 | 319 | 3 |
84 | 177 | 4 |
85 | 272 | 5 |
86 | 204 | 3 |
87 | 139 | 1 |
88 | 187 | 4 |
89 | 263 | 4 |
90 | 90 | 4 |
91 | 134 | 4 |
92 | 67 | 3 |
93 | 115 | 3 |
94 | 45 | 2 |
95 | 65 | 2 |
96 | 40 | 4 |
97 | 108 | 2 |
98 | 60 | 2 |
99 | 102 | 1 |
Here are the summaries.
loadgroup | cartons | blocks | cartonsPercent | blocksPercent |
---|---|---|---|---|
1 | 7610 | 22 | 26% | 22% |
2 | 3912 | 18 | 13% | 18% |
3 | 3429 | 12 | 12% | 12% |
4 | 9269 | 35 | 31% | 35% |
5 | 5388 | 13 | 18% | 13% |
It's very close to my target though.
Upvotes: 0