f0rty
f0rty

Reputation: 15

How do you compare multiple columns in Excel using Python?

I've got an excel sheet that has values such as:

ProductID Reference Component used DESCRIPTION
5325646 PD-42#23 ASDGE34#W-AL HOUSING PLATE
4563756 ITA-COPPER RAW AL WASHER
4345432 ASDGE34#W-AL ZK23456 3MM X 2MM X 10MM COPPER BAR
1245456 ASDGE34#W-AL ZK66634 1MM X 2MM X 9MM COPPER BAR

What I essentially want to do is match anything in "Component" with "Reference" and if they come true, like in this example, ASDGE34#W-AL matches in "Component" and in "Reference", so I'd want to copy the details over and shift it after "Description". The end result that I want to achieve is this.

ProductID Reference Component used DESCRIPTION Child Component ID DESCRIPTION
5325646 PD-42#23 ASDGE34#W-AL HOUSING PLATE ZK23456 & ZK654534 3MM X 2MM X 10MM COPPER BAR &1MM X 2MM X 9MM COPPER BAR
4563756 ITA-COPPER ZAAGD0545 ALUMINIUM BAR
5656565 0788866-7#D YN435611 DUSTER

Can anyone help with this? I've been using Python to try and iterate a loop over the "Components used" and the "Reference" column seeing what match but no luck so far.

Upvotes: 0

Views: 1112

Answers (1)

GordonAitchJay
GordonAitchJay

Reputation: 4860

I have code that does something similar. I've changed it to suit your problem.

The products are stored in 2 separate dict lists, accessed by Reference and Component used. You can easily iterate over the products_by_reference dict using a Component used value to get a list of products with a matching reference.

Note that this implementation assumes each column header is unique, as each row is stored as a dict where each column header is used as the key to access the corresponding value, so if a header is used twice, the first value will be overwritten. Because of this, I manually rename the header Child DESCRIPTION to DESCRIPTION after writing the header columns.

import os
from collections import defaultdict
import openpyxl

# Load the existing spreadsheet.
wb = openpyxl.load_workbook(filename="Products.xlsx")
ws = wb.worksheets[0]

# Style used for the header row
bold_font = openpyxl.styles.Font(bold=True)
grey_fill = openpyxl.styles.fills.PatternFill("solid", fgColor="F8F9F9")

# Store the header column names and numbers.
# Key = column name, value = column number. e.g. "ProductID": 1
header_columns = {}
for column_num in range(1, ws.max_column + 1):
    column_name = ws.cell(row=1, column=column_num).value
    header_columns[column_name] = column_num


# Read in all the values, starting after the header row.
# Store each product as a dict, and store every product in 2 separate 
# dict of lists, with the keys being the reference and component used.
products_by_reference = defaultdict(list)
products_by_component = defaultdict(list)
for row_num in range(2, ws.max_row + 1):
    product = {}
    for column_name, column_num in header_columns.items():
        product[column_name] = ws.cell(row=row_num, column=column_num).value
    
    reference = product["Reference"]
    component = product["Component used"]
    products_by_reference[reference].append(product)
    products_by_component[component].append(product)

# Iterate over the products by component, and copy and join all the 
# components used and descriptions to the relevant product's dict.
for component, products in products_by_component.items():
    # Use dicts here instead of sets if preserving the order is important.
    child_components = set()
    child_descriptions = set()
    for product in products_by_reference[component]:
        child_components.add(product["Component used"])
        child_descriptions.add(product["DESCRIPTION"])
    
    child_components_string = " & ".join(child_components)
    child_descriptions_string = " & ".join(child_descriptions)
    for product in products:
        product["Child Component ID"] = child_components_string
        product["Child DESCRIPTION"] = child_descriptions_string
    
    # products_by_reference will be used later for the new spreadsheet, and 
    # we don't want to include any products that have a reference which is a 
    # component used by another product, since they have now been merged with 
    # that other product.
    del products_by_reference[component]

# Write the products to a new spreadsheet.
wb = openpyxl.Workbook()
ws = wb.active
ws.title = "Products"

# Freeze the header row.
ws.freeze_panes = "A2"

# Add the extra headers.
column_num = len(header_columns)
header_columns["Child Component ID"] = column_num + 1
header_columns["Child DESCRIPTION"] = column_num + 2

# Write the headers.
for column_num, column_name in enumerate(header_columns, start=1):
    cell = ws.cell(row=1, column=column_num, value=column_name)
    cell.font = bold_font
    cell.fill = grey_fill

# This implementation doesn't allow there to be duplicate headers, as the 
# headers are used as keys to retrieve values from the product dicts. So the 
# the last header is manually changed from "Child DESCRIPTION" to 
# "DESCRIPTION" after it has been written to the sheet.
cell.value = "DESCRIPTION"

# Flatten the dictionary of lists, and write every product to the spreadsheet.
products = [product for products_list in products_by_reference.values() for product in products_list]
for row_num, product in enumerate(products, start=2):
    for column_name, column_num in header_columns.items():
        cell = ws.cell(row=row_num, column=column_num, value=product.get(column_name))

# Save the spreadsheet to file, then launch it.
filename = "Merged products.xlsx"
wb.save(filename)
os.startfile(filename)

Upvotes: 1

Related Questions