Reputation: 21
My 'dataframe' contain two columns, the first one is the SKU number and the second one is the part number for each SKU number. Some SKUs share same part numbers, how do I find these SKUs that share part numbers?
import pandas as pd
table_teste = pd.read_csv("table.csv")
print(table_teste)[see in the picture attached here the screenshot of the input vales][1]
Output:
SKU Part Number
0 4679343 126420
1 4679343 489136
2 4679343 490202
3 4679343 490282
4 4679343 491971
5 4679343 492963
6 4679343 626681
7 4679343 627996
8 4679343 628361
9 4679343 628379
10 4679343 628379
11 4679343 628408
12 4679343 628531
13 4679343 1105601
14 4679343 1140073
15 4679343 2169104
16 4679343 2169104
17 4679343 2169142
18 4679343 2185762
19 4679343 2194712
20 4679343 2195058
21 4679343 2256086
22 4679343 2315522
23 4679343 2315522
24 4679343 2319835
25 4679343 8314101
26 4679343 8314102
27 4679343 8314229
28 4679343 8314231
29 4679343 8314232
... ... ...
73953 WRO80CKDWA W11234774
73954 WRO80CKDWA W11239503
73955 WRO80CKDWA W11240332
73956 WRO80CKDWA W11240358
73957 WRO80CKDWA W11240361
73958 WRO80CKDWA W11240362
73959 WRO80CKDWA W11240363
73960 WRO80CKDWA W11282632
73961 WRO80CKDWA W11282632
73962 WRO80CKDWA W11293453
73963 WRO80CKDWA W11294381
73964 WRO80CKDWA W11294503
73965 WRO80CKDWA W11298984
73966 WRO80CKDWA W11308860
73967 WRO80CKDWA W11308879
73968 WRO80CKDWA W11314128
73969 WRO80CKDWA W11317776
73970 WRO80CKDWA W11323281
73971 WRO80CKDWA W11323282
73972 WRO80CKDWA W11323283
73973 WRO80CKDWA W11323284
73974 WRO80CKDWA W11366199
73975 WRO80CKDWA W11366205
73976 WRO80CKDWA W11366209
73977 WRO80CKDWA W11366214
73978 WRO80CKDWA W11366215
73979 WRO80CKDWA W11370412
73980 WRO80CKDWA W11370419
73981 WRO80CKDWA W11370494
73982 WRO80CKDWA ZCOMP_FREIGHT
Now I need to generate a matrix that has the SKU numbers in the row and the same SKU numbers in the column and in the matrix the count of how many part numbers the combination of SKU number 1 and SKU number 2 share that are the same. Same thing for SKU number 2 with SKU number 3 and so on. There is 182 SKU numbers in total.
Thank you
Upvotes: 2
Views: 818
Reputation: 83
You can try using groupby, convert the groups to list and reset index.
# dictionary of sku number as key and value as part number
# I'm assuming this is how the df might look like
d = {1: 2, 2: 3, 3: 2, 4: 2, 5: 3, 6: 2, 7: 3}
# making a dataframe out of the dict to resemble df in que
df = pd.DataFrame(d.items(), columns=['SKU Number', 'Part Number'])
df
Output:
SKU Number Part Number
0 1 2
1 2 3
2 3 2
3 4 2
4 5 3
5 6 2
6 7 3
# first groupby part numbers
g = df.groupby('Part Number')
# convert groups to list of two-SKU combinations and then reset index to create a new data
x = g['SKU Number'].apply(itertools.combinations(x, 2))).reset_index(name='SKU numbers')
x
Output:
Part Number SKU numbers
0 2 [(1, 3), (1, 4), (1, 6), (3, 4), (3, 6), (4, 6)]
1 3 [(2, 5), (2, 7), (5, 7)]
^ now we have all two-member SKU combinations for each part number. Let's explode the list in SKU numbers column.
x = x.explode('SKU numbers')
x
out:
Part Number SKU numbers
0 2 (1, 3)
0 2 (1, 4)
0 2 (1, 6)
0 2 (3, 4)
0 2 (3, 6)
0 2 (4, 6)
1 3 (2, 5)
1 3 (2, 7)
1 3 (5, 7)
Now we need to group pairs of SKU numbers and count the part numbers associated with them
x = x.groupby('SKU numbers').count().reset_index()
x
out:
SKU numbers Part Number
0 (1, 3) 1
1 (1, 4) 1
2 (1, 6) 1
3 (2, 5) 1
4 (2, 7) 1
5 (3, 4) 1
6 (3, 6) 1
7 (4, 6) 1
8 (5, 7) 1
^ Now we have counts for part numbers for each SKU pair. Let's construct the matrix.
import numpy as np
indexes = x['SKU numbers'].values
part_number_counts = x['Part Number'].values
# in my small case, we have 7 unique SKUs
unique_SKUs = 7
# creating a zero matrix so that we can populate part num counts
# for each SKU pair
a = np.zeros((unique_SKUs, unique_SKUs))
a
out:
array([[0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0.]])
# split [(x1, y1), (x2, y2) ...] to rows -> [x1, x2 ...]
# columns -> [y1, y2 ....]
rows, columns = map(np.array , zip(*indexes))
# rows-1, columns-1 are done to make index 0-based
a[rows-1, columns-1] = part_number_counts
a
out:
array([[0., 0., 1., 1., 0., 1., 0.],
[0., 0., 0., 0., 1., 0., 1.],
[0., 0., 0., 1., 0., 1., 0.],
[0., 0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0., 0., 1.],
[0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0.]])
For the final part, I'm using my indexes(SKU pairs), converting them to 0-based indexes and updating their corresponding part_number_counts to the zero matrix to get the resultant matrix.
The resultant matrix will have shape of (unique_SKU_numbers, unique_SKU_numbers) and the value i,j would correspond to the part_number_counts
Upvotes: 0
Reputation: 789
Find all part numbers with more than 1 SKU:
partNumber_w_dupSKU = data %>%
group_by(partNumber) %>%
summarize(n_SKU = n_distinct(SKU)) %>%
ungroup() %>%
filter(n_SKU > 1)
Find all SKU's associated with these part numbers:
data %>%
arrange(SKU) %>%
filter(partNumber %in% partNumber_w_dupSKU$partNumber)
Upvotes: 1
Reputation: 9
You can use a groupby() on part numbers which will group your dataframe accordingly If you do groupby on SKU numbers then it will show a dataframe with SKU number that share a common part number Or vice versa
Upvotes: 0