Reputation: 11
I am extracting excel sheet data using python's openpyxl library .I can easily extract data but the problem is i want the 'insert table ' data which is present in special tables to be extracted separately and the other non table cell data to be extracted separately (means it ignores the table cells and only extracts the cell data)
I can read tables python using openpyxl
code :
import pandas as pd
from openpyxl import load_workbook
# Load the workbook
filename = "/Users/snikesh/Desktop/untitled folder/Input.xlsx"
wb = load_workbook(filename)
# Iterate over all sheets in the workbook
for sheet_name in wb.sheetnames:
ws = wb[sheet_name] # Get the sheet by name
# Check if there are tables on the sheet
if ws.tables:
# Iterate over the tables and print them as dataframes
for table in ws.tables.values():
data = ws[table.ref] # Get the range covered by the table
rows = [[cell.value for cell in row] for row in data] # Extract the data into rows of cell values
df = pd.DataFrame(rows) # Create a DataFrame, using the first row for column headers
print(df)
else:
print("No tables found on this sheet.")
and the cell data(along with table data ) is extracted by simple python code :
import pandas as pd
from openpyxl import load_workbook
# Load the workbook
workbook_path = '/path/to/your/excel/file.xlsx'
workbook = load_workbook(workbook_path)
# Dictionary to hold dataframes for each sheet
dataframes = {}
# Loop through each sheet in the workbook
for sheet_name in workbook.sheetnames:
worksheet = workbook[sheet_name]
data = []
# Extract data from each row in the sheet
for row in worksheet.iter_rows(values_only=True):
data.append(row)
# Create a DataFrame and add it to the dictionary
# Assuming the first row is the header
df = pd.DataFrame(data)
dataframes[sheet_name] = df
# Access a specific DataFrame
sheet_name_to_access = 'Sheet1' # Example sheet name
print(dataframes[sheet_name_to_access]) # Print the DataFrame for 'Sheet1'
The problem is i want to to print tables separately and not with the normal data.
I am not able to modify the code such that it ignores the tables cells and print only non table cell data separately .
If i am able to do so i can extract table data using the first code.
Upvotes: 1
Views: 47
Reputation: 6630
In that you know what rows the Tables reside in you can load the sheet in to a dataframe using pandas read_excel
skipping the rows that the Tables are in.
For example if the the Table is 'A7:J14'
, extract the rows, add necessary padding for header(s) and skip those rows
for sheet_name in workbook.sheetnames:
df = pd.read_excel("tables.xlsx",
sheet_name=sheet_name ,
skiprows=range(6, 14),
)
or for multiple range selections
table1_rng = list(range(6, 14))
table2_rng = list(range(18, 25))
table3_rng = list(range(31, 37))
for sheet_name in workbook.sheetnames:
df = pd.read_excel("tables.xlsx",
sheet_name=sheet_name,
skiprows=table1_rng+table2_rng+table3_rng
)
Upvotes: 1