I have df1 which has three columns (loadgroup, cartons, blocks) like this loadgroup cartons blocks cartonsPercent blocksPercent 1 2269 14 26% 21% 2 1168 13 13% 19% 3 937 8 11% 12% 4 2753 24 31% 35% 5 1686 9 19% 13% total(sum of column) 8813 68 100% 100% The interpretation is like this: out of df1 26% cartons which is also 21% of blocks are assigned to loadgroup 1, etc. we can assume blocks are 1 to 68, cartons are 1 to 8813. I also have df2 which also has cartons and blocks columns. but does not have loadgroup. My goal is to assign loadgroup (1-5 as well) to df2 (100 blocks 29608 cartons in total), but keep the proportions, for example, for df2, 26% cartons 21% blocks assign loadgroup 1, 13% cartons 19% blocks assign loadgroup 2, etc. df2 is like this: block cartons 0 533 1 257 2 96 3 104 4 130 5 71 6 68 7 87 8 99 9 51 10 291 11 119 12 274 13 316 14 87 15 149 16 120 17 222 18 100 19 148 20 192 21 188 22 293 23 120 24 224 25 449 26 385 27 395 28 418 29 423 30 244 31 327 32 337 33 249 34 528 35 528 36 494 37 540 38 368 39 533 40 614 41 462 42 350 43 618 44 463 45 552 46 397 47 401 48 397 49 365 50 475 51 379 52 541 53 488 54 383 55 354 56 760 57 327 58 211 59 356 60 552 61 401 62 320 63 368 64 311 65 421 66 458 67 278 68 504 69 385 70 242 71 413 72 246 73 465 74 386 75 231 76 154 77 294 78 275 79 169 80 398 81 227 82 273 83 319 84 177 85 272 86 204 87 139 88 187 89 263 90 90 91 134 92 67 93 115 94 45 95 65 96 40 97 108 98 60 99 102 total 100 blocks 29608 cartons I want to add loadgroup column to df2, try to keep those proportions as close as possible. How to do it please? Thank you very much for the help.

pandasdataframemultiple-columnssample-dataproportions

Reputation: 29

pandas df add new column based on proportion of two other columns from another dataframe

I have df1 which has three columns (loadgroup, cartons, blocks) like this

loadgroup	cartons	blocks	cartonsPercent	blocksPercent
1	2269	14	26%	21%
2	1168	13	13%	19%
3	937	8	11%	12%
4	2753	24	31%	35%
5	1686	9	19%	13%
total(sum of column)	8813	68	100%	100%

The interpretation is like this: out of df1 26% cartons which is also 21% of blocks are assigned to loadgroup 1, etc. we can assume blocks are 1 to 68, cartons are 1 to 8813. I also have df2 which also has cartons and blocks columns. but does not have loadgroup. My goal is to assign loadgroup (1-5 as well) to df2 (100 blocks 29608 cartons in total), but keep the proportions, for example, for df2, 26% cartons 21% blocks assign loadgroup 1, 13% cartons 19% blocks assign loadgroup 2, etc. df2 is like this:

block	cartons
0	533
1	257
2	96
3	104
4	130
5	71
6	68
7	87
8	99
9	51
10	291
11	119
12	274
13	316
14	87
15	149
16	120
17	222
18	100
19	148
20	192
21	188
22	293
23	120
24	224
25	449
26	385
27	395
28	418
29	423
30	244
31	327
32	337
33	249
34	528
35	528
36	494
37	540
38	368
39	533
40	614
41	462
42	350
43	618
44	463
45	552
46	397
47	401
48	397
49	365
50	475
51	379
52	541
53	488
54	383
55	354
56	760
57	327
58	211
59	356
60	552
61	401
62	320
63	368
64	311
65	421
66	458
67	278
68	504
69	385
70	242
71	413
72	246
73	465
74	386
75	231
76	154
77	294
78	275
79	169
80	398
81	227
82	273
83	319
84	177
85	272
86	204
87	139
88	187
89	263
90	90
91	134
92	67
93	115
94	45
95	65
96	40
97	108
98	60
99	102
total 100 blocks	29608 cartons

I want to add loadgroup column to df2, try to keep those proportions as close as possible. How to do it please? Thank you very much for the help.

Upvotes: -1

Answers (1)

Lenny

Reputation: 29

I don't know how to find loadgroup column based on both cartons percent and blocks percent. But generate random loadgroup based on either cartons percent or blocks percent is easy.

Here is what I did. I generate 100,000 seeds first, then for each seed, I add column loadgroup1 based on cartons percent, loadgroup2 based on blocks percent, then calculate both percentages, then compare with df1 percentages, get absolute difference, record it. For these 100,000 seeds, I take the minimum difference one as my solution, which is sufficient for my job.

But this is not the optimal solution, and I am looking for quick and easy way to do this. Hope somebody can help. Here is my code.

df = pd.DataFrame()    
np.random.seed(10000)    
seeds = np.random.randint(1, 1000000, size = 100000)    
for i in range(46530, 46537):
print(seeds[i])
np.random.seed(seeds[i])
df2['loadGroup1'] = np.random.choice(df1.loadgroup, len(df2), p = df1.CartonsPercent)
df2['loadGroup2'] = np.random.choice(df1.loadgroup, len(df2), p = df1.blocksPercent)

df2.reset_index(inplace = True)
three = pd.DataFrame(df2.groupby('loadGroup1').agg(Cartons = ('cartons', 'sum'), blocks = ('block', 'count')))
three['CartonsPercent'] = three.Cartons/three.Cartons.sum()
three['blocksPercent'] = three.blocks/three.blocks.sum()
four = df1[['CartonsPercent','blocksPercent']] - three[['CartonsPercent','blocksPercent']]
four = four.abs()

subdf = pd.DataFrame({'i':[i],'Seed':[seeds[i]], 'Percent':['CartonsPercent'], 'AbsDiff':[four.sum().sum()]})
df = pd.concat([df,subdf])

three = pd.DataFrame(df2.groupby('loadGroup2').agg(Cartons = ('cartons', 'sum'), blocks = ('block', 'count')))
three['CartonsPercent'] = three.Cartons/three.Cartons.sum()
three['blocksPercent'] = three.blocks/three.blocks.sum()
four = df1[['CartonsPercent','blocksPercent']] - three[['CartonsPercent','blocksPercent']]
four = four.abs()

subdf = pd.DataFrame({'i':[i],'Seed':[seeds[i]], 'Percent':['blocksPercent'], 'AbsDiff':[four.sum().sum()]})
df = pd.concat([df,subdf])
df.sort_values(by = 'AbsDiff', ascending = True, inplace = True)
df = df.head(10)

Actually the first row of df will tell me the seed I am looking for, I kept 10 rows just for curiosity.

Here is my solution.

block	cartons	loadgroup
0	533	4
1	257	1
2	96	4
3	104	4
4	130	4
5	71	2
6	68	1
7	87	4
8	99	4
9	51	4
10	291	4
11	119	2
12	274	2
13	316	4
14	87	4
15	149	5
16	120	3
17	222	2
18	100	2
19	148	2
20	192	3
21	188	4
22	293	1
23	120	2
24	224	4
25	449	1
26	385	5
27	395	3
28	418	1
29	423	4
30	244	5
31	327	1
32	337	5
33	249	4
34	528	1
35	528	1
36	494	5
37	540	3
38	368	2
39	533	4
40	614	5
41	462	4
42	350	5
43	618	4
44	463	2
45	552	1
46	397	3
47	401	3
48	397	1
49	365	1
50	475	4
51	379	1
52	541	1
53	488	2
54	383	2
55	354	1
56	760	5
57	327	4
58	211	2
59	356	5
60	552	4
61	401	1
62	320	1
63	368	3
64	311	3
65	421	2
66	458	5
67	278	4
68	504	5
69	385	4
70	242	4
71	413	1
72	246	2
73	465	5
74	386	4
75	231	1
76	154	4
77	294	4
78	275	1
79	169	4
80	398	4
81	227	4
82	273	1
83	319	3
84	177	4
85	272	5
86	204	3
87	139	1
88	187	4
89	263	4
90	90	4
91	134	4
92	67	3
93	115	3
94	45	2
95	65	2
96	40	4
97	108	2
98	60	2
99	102	1

Here are the summaries.

loadgroup	cartons	blocks	cartonsPercent	blocksPercent
1	7610	22	26%	22%
2	3912	18	13%	18%
3	3429	12	12%	12%
4	9269	35	31%	35%
5	5388	13	18%	13%

It's very close to my target though.

Upvotes: 0

block	cartons	loadgroup
0	533	4
1	257	1
2	96	4
3	104	4
4	130	4
5	71	2
6	68	1
7	87	4
8	99	4
9	51	4
10	291	4
11	119	2
12	274	2
13	316	4
14	87	4
15	149	5
16	120	3
17	222	2
18	100	2
19	148	2
20	192	3
21	188	4
22	293	1
23	120	2
24	224	4
25	449	1
26	385	5
27	395	3
28	418	1
29	423	4
30	244	5
31	327	1
32	337	5
33	249	4
34	528	1
35	528	1
36	494	5
37	540	3
38	368	2
39	533	4
40	614	5
41	462	4
42	350	5
43	618	4
44	463	2
45	552	1
46	397	3
47	401	3
48	397	1
49	365	1
50	475	4
51	379	1
52	541	1
53	488	2
54	383	2
55	354	1
56	760	5
57	327	4
58	211	2
59	356	5
60	552	4
61	401	1
62	320	1
63	368	3
64	311	3
65	421	2
66	458	5
67	278	4
68	504	5
69	385	4
70	242	4
71	413	1
72	246	2
73	465	5
74	386	4
75	231	1
76	154	4
77	294	4
78	275	1
79	169	4
80	398	4
81	227	4
82	273	1
83	319	3
84	177	4
85	272	5
86	204	3
87	139	1
88	187	4
89	263	4
90	90	4
91	134	4
92	67	3
93	115	3
94	45	2
95	65	2
96	40	4
97	108	2
98	60	2
99	102	1

block	cartons	loadgroup
0	533	4
1	257	1
2	96	4
3	104	4
4	130	4
5	71	2
6	68	1
7	87	4
8	99	4
9	51	4
10	291	4
11	119	2
12	274	2
13	316	4
14	87	4
15	149	5
16	120	3
17	222	2
18	100	2
19	148	2
20	192	3
21	188	4
22	293	1
23	120	2
24	224	4
25	449	1
26	385	5
27	395	3
28	418	1
29	423	4
30	244	5
31	327	1
32	337	5
33	249	4
34	528	1
35	528	1
36	494	5
37	540	3
38	368	2
39	533	4
40	614	5
41	462	4
42	350	5
43	618	4
44	463	2
45	552	1
46	397	3
47	401	3
48	397	1
49	365	1
50	475	4
51	379	1
52	541	1
53	488	2
54	383	2
55	354	1
56	760	5
57	327	4
58	211	2
59	356	5
60	552	4
61	401	1
62	320	1
63	368	3
64	311	3
65	421	2
66	458	5
67	278	4
68	504	5
69	385	4
70	242	4
71	413	1
72	246	2
73	465	5
74	386	4
75	231	1
76	154	4
77	294	4
78	275	1
79	169	4
80	398	4
81	227	4
82	273	1
83	319	3
84	177	4
85	272	5
86	204	3
87	139	1
88	187	4
89	263	4
90	90	4
91	134	4
92	67	3
93	115	3
94	45	2
95	65	2
96	40	4
97	108	2
98	60	2
99	102	1

pandas df add new column based on proportion of two other columns from another dataframe

Answers (1)

Related Questions

block	cartons	loadgroup
0	533	4
1	257	1
2	96	4
3	104	4
4	130	4
5	71	2
6	68	1
7	87	4
8	99	4
9	51	4
10	291	4
11	119	2
12	274	2
13	316	4
14	87	4
15	149	5
16	120	3
17	222	2
18	100	2
19	148	2
20	192	3
21	188	4
22	293	1
23	120	2
24	224	4
25	449	1
26	385	5
27	395	3
28	418	1
29	423	4
30	244	5
31	327	1
32	337	5
33	249	4
34	528	1
35	528	1
36	494	5
37	540	3
38	368	2
39	533	4
40	614	5
41	462	4
42	350	5
43	618	4
44	463	2
45	552	1
46	397	3
47	401	3
48	397	1
49	365	1
50	475	4
51	379	1
52	541	1
53	488	2
54	383	2
55	354	1
56	760	5
57	327	4
58	211	2
59	356	5
60	552	4
61	401	1
62	320	1
63	368	3
64	311	3
65	421	2
66	458	5
67	278	4
68	504	5
69	385	4
70	242	4
71	413	1
72	246	2
73	465	5
74	386	4
75	231	1
76	154	4
77	294	4
78	275	1
79	169	4
80	398	4
81	227	4
82	273	1
83	319	3
84	177	4
85	272	5
86	204	3
87	139	1
88	187	4
89	263	4
90	90	4
91	134	4
92	67	3
93	115	3
94	45	2
95	65	2
96	40	4
97	108	2
98	60	2
99	102	1