Mathieu Moquin
Mathieu Moquin

Reputation: 37

How can I concatenate multiple rows of excel data into one?

I'm currently facing an issue where I need to bring all of the data shown in the images below into one line only.

enter image description here

So using Python and Openpyxl, I tried to write a parsing script that reads the line and only copies when values are non-null or non-identical, into a new workbook.

I get out of range errors, and the code does not keep just the data I want. I've spent multiple hours on it, so I thought I would ask here to see if I can get unstuck.

I've read some documentation on Openpyxl and about making lists in python, tried a couple of videos on youtube, but none of them did exactly what I was trying to achieve.

import openpyxl
from openpyxl import Workbook

path = "sample.xlsx"
wb = openpyxl.load_workbook(path)
ws = wb.active
path2 = "output.xlsx"
wb2 = Workbook()
ws2 = wb2.active

listab = []
rows = ws.max_row
columns = ws.max_column

for i in range (1, rows+1):
   listab.append([])

cellValue = " "    
prevCell = " "

for c in range (1, rows+1):
    for r in range(1, columns+1):
        cellValue = ws.cell(row=r, column=c).value
            if cellValue == prevCell:
                listab[r-1].append(prevCell)
            elif cellValue == "NULL":
                listab[r-1].append(prevCell)
            elif cellValue != prevCell:
                listab[r-1].append(cellValue)
                prevCell = cellValue           

for r in range(1, rows+1):
    for c in range (1, columns+1):
        j = ws2.cell(row = r, column=c)
        j.value = listab[r-1][c-1]

print(listab)

wb2.save("output.xlsx")

There should be one line with the below information:

ods_service_id | service_name| service_plan_name| CPU | RAM | NIC | DRIVE |

Upvotes: 3

Views: 3137

Answers (2)

Charlie Clark
Charlie Clark

Reputation: 19537

To be honest, I think you've managed to get confused by data structures and come up with something far more complicated than you need.

One approach that would suit would be to use Python dictionaries for each service, updating them row by row.

wb = load_workbook("sample.xlsx")
ws = wb.active
objs = {}
headers = next(ws.iter_rows(min_row=1, max_row=1, values_only=True))
for row in ws.iter_rows(min_row=2, values_only=True):
    if row[0] not in objs:
        obj = {key:value for key, value in zip(headers, row)}
        objs[obj['ods_service_id']] = obj
    else:# update dict with non-None values
        extra = {key:value for key, value in zip(headers[3:], row[3:]) if value != "NULL"}
        obj.update(extra)

# write to new workbook
wb2 = Workbook()
ws2 = wb2.active
ws2.append(headers)
for row in objs.values(): # do they need sorting?
    ws2.append([obj[key] for key in headers])

Note how you can do everything without using counters.

Upvotes: 1

Sebastien D
Sebastien D

Reputation: 4482

Personally I would go with pandas.

import pandas as pd

#Loading into pandas
df_data = pd.read_excel('sample.xlsx')
df_data.fillna("NO DATA",inplace=True)  ## Replaced nan values with "NO DATA"
unique_ids = df_data.ods_service_ids.unique()

#Storing pd into a list
records_list = df_data.to_dict('records') 
keys_to_check = ['service_name', 'service_plan_name', 'CPU','RAM','NIC','DRIVE']
processed = {}

#Go through unique ids
for key in unique_ids:
    processed[key] = {}

    #Get related records
    matching_records = [y for y in records_list if y['ods_service_ids'] == key]
    #Loop through records
    for record in matching_records:
        #For each key to check, save in dict if non null
        processed[key]['ods_service_ids'] = key
        for detail_key in keys_to_check:
            if record[detail_key] != "NO DATA" :
                processed[key][detail_key] = record[detail_key]
        ##Note : doesn't handle duplicate values for different keys so far


#Records are put back in list
output_data = [processed[x] for x in processed.keys()]
# -> to Pandas
df = pd.DataFrame(output_data)[['ods_service_ids','service_name', 'service_plan_name', 'CPU','RAM','NIC','DRIVE']]

#Export to Excel
df.to_excel("output.xlsx",sheet_name='Sheet_name_1', index=False)

The above should work but I wasn't really sure on how you wanted to save duplicated records for the same id. Do you look to store them as DRIVE_0, DRIVE_1, DRIVE_2 ?

EDIT:

df could be exported in a different way. Replaced below #export to Excel with the following :

df.to_excel("output.xlsx",sheet_name='Sheet_name_1')

EDIT 2:

with no input data it was hard to see any flows. Corrected the code above with fake data

Upvotes: 2

Related Questions