Reputation: 793
So I've an excel sheet that has multiple tabs and each individual tab has multiple tables in it. So i want to read the file in such a way that it reads each table from each tab of the sheet, for instance,
Tab1 has five tables in it.
Tab2 has Ten tables in it.
.....
.....
I want to read each one of these table in pandas dataframe and then save it to sql database. I know how how to read multiple tabs from the excel sheet.
Can anyone help me out here or point me to a direction where i can find a lead?
The tables in the tab are pre-defined and have name. Thats how it looks like in each tab Tab from excel sheet
Upvotes: 1
Views: 1216
Reputation: 28644
You probably have to tweak it to match your data; imagine if you have some tables below and some above. This, hopefully, should point you in the right direction. Also, note the number of for loops I used; I believe you can do better and optimize it further.
from openpyxl import load_workbook
from collections import defaultdict
from itertools import product, groupby
from operator import itemgetter
wb = load_workbook(filename="test.xlsx")
sheet = wb["Sheet1"]
green_rows = defaultdict(list)
rest_data = []
for row in sheet:
for cell in row:
look for the green rows; they contain the headers
if cell.fill.fgColor.rgb == "FFA2D722":
# take advantage of the fact that header
# is the first entry in that row
if cell.value:
val = cell.value
green_rows[(val, cell.row)].append(cell.column)
else:
if cell.value not in (None, ""): # so the 0s are not lost
rest_data.append((cell.row, cell.column, cell.value))
# get the max and minimum column positions
# note the addition of 1 to the max,
# this is necessary when iterating to sort the data
# in the next section
green_rows = [
(name, row, range(min(value), max(value) + 1))
for (name, row), value in green_rows.items()
]
box = []
# here the green rows and the rest of the data
# are combined, then filtered for the respective
# sections
combo = product(green_rows, rest_data)
for (header, header_row, header_column_range), (
cell_row,
cell_column,
cell_value,
) in combo:
# this is where the filtration occurs
if (header_row < cell_row) and (cell_column in header_column_range):
box.append((header, cell_row, cell_column, cell_value))
final = defaultdict(list)
content = groupby(box, itemgetter(1, 0))
# another iteration to get the final result
for key, value in content:
final[key[-1]].append([val[-1] for val in value])
You can create your dataframe for each of the headers:
pd.DataFrame(final["Address Association"])
0 1 2 3 4 5
0 Column Name in DB Name Description SortOrder BusinessMeaningName Obsolete
1 Field Type nvarchar(100) nvarchar(255) int nvarchar(50) bit
2 Mandatory Yes Yes Yes No Yes
3 Foreign Key - - - - -
4 Optional Feature - - - - -
5 Field Name in U4SM Name Description Sort Order Business Meaning Name Obsolete
6 Address.Primary Primary Use this address by default. 1 Address.Primary 0
7 Address.Billing Billing address for billing. 2 Address.Billing 0
8 Address.Emergency Emergency use this for emergency. 3 Address.Emergency 0
9 Address.Emergency SMS Emergency SMS use this for emergency SMS. 4 Address.Emergency SMS 0
10 Address.Deceased Deceased address for deceased. 5 Address.Deceased 0
11 Address.Home Home address for home. 8 Address.Home 0
12 Address.Mailing Mailing address for mailing. 9 Address.Mailing 0
13 Address.Mobile Mobile use this for mobile. 10 Address.Mobile 0
14 Address.School School address for school. 13 Address.School 0
15 Address.SMS SMS use this for SMS text. 15 Address.SMS 0
16 Address.Work Work address for work 16 Address.Work 0
17 Address.Permanent Permanent Permanent Address 17 Address.Permanent 0
18 Address.HallsOfResidence Halls of Residence Halls of Residence 18 Address.HallsOfResidence 0
Upvotes: 1