Reputation: 513
Thanks to pandas, we could read a whole sheet into a data frame with the "read_excel" function.
I would like to use the same method using xlwings. In fact, my Workbook is already open and I don't want to use read_excel function (witch will take too long to execute by the way) but use the power of xlwings to save into a dataframe a whole sheet.
In fact with xlwings we could save a range into a dataframe. That mean I have to known the range size. But I guess there is a better (and quicker!) way to do that, isn't it?
Do you have some ideas to do that? Thanks a lot!
Edit: One exemple of one sheet I would like to transfer into a dataframe as read_excel would do it.
Name Point Time Power Test1 Test2 Test3 Test4 ##
Test 0 1 10 4 24 144
2 20 8 48 288
3 30 12 72 432
4 40 16 96 576
5 50 20 120 720
6 60 24 144 864
7 70 28 168 1008
8 80 32 192 1152
9 90 36 216 1296
10 100 40 240 1440
11 110 44 264 1584
12 120 48 288 1728
Upvotes: 9
Views: 32335
Reputation: 1
You can try my function as below.
import pandas as pd
import os
import xlwings as xw
# =============================================================================
# INSTRUCTIONS FOR read_pw_excel
# Function can also read in CSV files. But as the underlying software is Excel, if the CSV file exceeds the maximum number of rows allowed in Excel, only the 1st row up to the limit will be read in.
# 1) Leave sheet blank to read in all sheets in the Excel file.
# Output will be a vertically stacked df of all sheets. Otherwise, specifiy the exact sheet to read in for df.
# If reading in all sheets,
# Code will read in the 1st row of the 1st sheet as the header.
# Whether subsequent sheets have header row or do not have header row (ie all rows are data), code will use the 1st row of the 1st sheet as the header.
# Code will drop columns with None as header (including the index column if it doesn't have header)
# Code will convert all header to upper case
# 2) Leave excel_range blank to read in the used range in the sheet(s). Otherwise specify the exact range like 'A:B'.
# 3) Leave pw blank if Excel file is not password protected
# =============================================================================
def read_pw_excel(link, sheet='', excel_range = '', pw=''):
if not os.path.exists(link):
print("link does not exist. Please try again.")
raise Exception("LinkNotFoundError")
df_merged=pd.DataFrame()
col_names_first=''
col_names_next=''
lt_type=''
index_col=''
app = xw.App()
app.display_alerts=False
if pw=='':
#filebook = xw.Book(link)
filebook = app.books.open(link, update_links=False)
else: filebook = app.books.open(link, update_links=False, password=pw)
sht_ct = filebook.sheets.count
for ct in range(0, sht_ct):
if sheet=='':
if excel_range == '':
range_to_use = filebook.sheets[ct].used_range
else: range_to_use = excel_range
data = filebook.sheets[ct].range(range_to_use).value # Note this will be a list Eg [['Name', 'Age'], ['Tom', 4], ['Harry', 3]]. If there is only 1 column or 1 row, it will be ['Name', 'Tom', 'Harry'] or [ 'Tom', 'Tan', 6]
else:
if excel_range == '':
range_to_use = filebook.sheets[sheet].used_range
else: range_to_use = excel_range
data=filebook.sheets[sheet].range(range_to_use).value
if ct==0:
# if there is only 1 column, we convert header into a list: [['Name'], 'Tom', 'Harry'] - otherwise issue in converting to upper later when using [x.upper() if x is not None else x for x in data[0]]. Each letter of 'Name' will be separate element
if not isinstance(data[0], list):
data[0]=[data[0]]
lt_type='single_col'
pos_none= [pos for pos, x in enumerate(data[0]) if x == None] # data[0] is the header column. # when using used_range Python may read in some None columns, we find the position of these None columns
if 0 in pos_none: # If column has NONE header, it is probably an index column. We will need an indicator to later remove this index column
index_col='Y'
col_names_first_ls=data[0]
col_names_first_ls = [x for x in col_names_first_ls if x != None] # when using used_range Python may read in some None columns, we drop these
col_names_first= ''.join(str(x).replace(' ', '').upper() for x in col_names_first_ls)
else:
# if there is only 1 column, we convert header into a list: [['Name'], 'Tom', 'Harry'] - otherwise issue in converting to upper later when using [x.upper() if x is not None else x for x in data[0]]. Each letter of 'Name' will be separate element
if not isinstance(data[0], list) and lt_type=='single_col':
pos_none=[]
try:
data[0]=[data[0]]
except: #TypeError: 'str' object does not support item assignment. Occurs when data is a single str/number (ie, there is only 1 column with 1 row)
data=[[data]]
col_names_next_ls=data[0]
elif not isinstance(data[0], list) and lt_type!='single_col': # imply there is only 1 data row without header
col_names_next_ls=data
col_names_next_ls = [x for x in col_names_next_ls if x != None]
else: # imply there is > 1 row and >1 col
pos_none= [pos for pos, x in enumerate(data[0]) if x == None]
col_names_next_ls=data[0]
col_names_next_ls = [x for x in col_names_next_ls if x != None]
col_names_next= ''.join(str(x).replace(' ', '').upper() for x in col_names_next_ls)
# for 1st sheet and 2nd onwards sheets which have same header as 1st sheet
if (ct==0) or ((ct!=0) & (col_names_first==col_names_next)):
data[0]=[x for x in data[0] if x != None] # we remove any None header
# IF THERE IS NONE header column read in, we remove all the data (should all be None) for that column
if pos_none:
for pos, x in enumerate(data[1:]):
i=0
for n in pos_none:
n=n+i
x.pop(n)
i-=1
cols=[str(x).upper() if x is not None else x for x in data[0]] # this line will have issue if data[0] is a single string of the column name
# cols=[]
# if isinstance(data[0], list):
# for x in data[0]:
# if x is not None:
# cols.append(x.upper())
# else: cols.append(x)
# elif not isinstance(data[0], list): cols.append(data[0].upper())
df = pd.DataFrame(data[1:], columns = cols)
df = df[df.columns.dropna()] # drop columns with None as header. Some Excel file have >1 None columns, duplicate columns will have issue with concat
#df.columns = map(str.upper, df.columns) # May not work if header has None
df_merged=pd.concat([df_merged, df],axis=0, ignore_index=True)
elif (ct!=0) & (col_names_first!=col_names_next): # in case the 2nd onwards sheets don't have header (ie. 1st row is data), we need to add the header of the 1st sheet to it in order to concat
if lt_type=='single_col': data[0]=data[0][0] # if only 1 column, the 1st row data shan't be a list [['Tom'], 'Dick']
if len (col_names_next_ls) > len (col_names_first_ls): # Possibly some None header Column read in for ct!=. It could be an index column
# IF THERE IS an index column read in, we remove all the data for that column and don't include it
if index_col=='Y':
for x in data[0:]:
try:
x.pop(0)
except: #AttributeError: 'float'/ 'str' object has no attribute 'pop'. This occurs if it is a single row. And data is a single list (not an embedded list) Eg [10.0, 'tom_left', 20.0, 200.0, 31.0]
data.pop(0)
break
new_col_ls=col_names_first_ls
elif len (col_names_next_ls) < len (col_names_first_ls):
print("Number of column in Sheet " + str(ct) + " is fewer than in Sheet 0.")
raise Exception("Number of Columns in Sheets Differ Error")
elif len (col_names_next_ls) == len (col_names_first_ls):
new_col_ls=col_names_first_ls
cols=[x.upper() if x is not None else x for x in new_col_ls]
try:
df = pd.DataFrame(data[0:], columns = cols)
except: # when data[0:] is a single list with >1 col, sometimes will have ValueError: Shape of passed values is (x, 1), indices imply (x, x) Then will have to nest it in a list
df = pd.DataFrame([data[0:]], columns = cols)
df = df[df.columns.dropna()]
#df.columns = map(str.upper, df.columns)
df_merged=pd.concat([df_merged, df],axis=0, ignore_index=True)
if sheet!='': break
filebook.close()
app.quit()
return df_merged
Upvotes: 0
Reputation: 146
I made it myself. It is a function that converts data in the range of startCol and endCol into Pandas data frames by selecting a specific cell location in the search parameter. I'm using it classified, so I don't know if it's running well, but I hope it helps.
When I use it, I don't use execlFilePath, sheetName, startCol, endCol, but I use integrated cell_range to specify the range, but if I add these integrated functions, the code will be long, so I pull it out and upload it.
import re
import pandas as pd
import xlwings as xw
def GetDataFrame(execlFilePath, sheetName, startCol=None, endCol='Z', search='A1'):
r'''
execlFilePath : 엑셀 파일 경로
sheetName : 시트 이름
startCol : 시작 열
endCol : 끝 열
search : 특정 데이터 검색할 cell위치
'''
# search cell 행열 위치 분해
searchCOL, searchROW = re.match(r'([A-Z]+)(\d+)', search).groups()
# 시작열이 없을 경우
if not startCol:
startCol = searchCOL
workbook = xw.Book(execlFilePath)
# 시트 선택
worksheet: xw.main.Sheet = workbook.sheets[sheetName]
# 마지막 데이터 열번호
row = worksheet.range(search).end('down').row
# 데이터 가져오기
data = worksheet.range(f'{startCol}{searchROW}:{endCol}{row}').value
# 판다스 데이터프레임으로 변환
return pd.DataFrame(data[1:], columns=data[0])
Upvotes: 0
Reputation: 8473
xlwings does provide api to load whole sheet. To do that, use used_range
api which reads whole used part of the sheet. (Of course we don't want to get unused rows values, do we? ;-))
Anyway here is a snippet code on how to do it:
import pandas as pd
import xlwings as xw
workbook = xw.Book('some.xlsx')
sheet1 = workbook.sheets['sheet1'].used_range.value
df = pd.DataFrame(sheet1)
That's all.
Upvotes: 8
Reputation: 11
I spent more time reading a 20M Excel using pandas.read_excel. But xlwings reads Excel very quickly. I will consider reading with xlwings and converting to a Dataframe. I think I have the same needs as the title owner. Xlwings made some adjustments during the four years. So I made some changes to the code of the first answerer. `
import xlwings as xw
import pandas as pd
def GetDataFrame(wb_file,Sheets_i,N,M):
wb = xw.books(wb_file) #open your workbook
#Specify the value of the cell of the worksheet
Data=wb.sheets[Sheets_i].range((1,1),(N,M)).value
Data=pd.DataFrame(Data)
Data=Data.dropna(how='all',axis=1)
Data=Data.dropna(how='all',axis=0)
return Data
`
Upvotes: 1
Reputation: 1289
You can use built-in converters to bring it in one line:
df = sht.range('A1').options(pd.DataFrame,
header=1,
index=False,
expand='table').value
Upvotes: 22
Reputation: 513
In fact, I could do something like that :
import xlwings as xw
import pandas as pd
def GetDataFrame(Sheet,N,M):
wb = xw.Workbook.active()
Data=xw.Range(Sheet,(1,1),(N,M)).value
Data=pd.DataFrame(Data)
Data=Data.dropna(how='all',axis=1)
Data=Data.dropna(how='all',axis=0)
return Data
Upvotes: 4
Reputation: 85612
You can read from multiple sheets with pandas:
excel_file = pd.ExcelFile('myfile.xls')
df1 = excel_file.parse('Sheet1')
df2 = excel_file.parse('Sheet2')
So, just open one file after the other, read from the sheets you want and process the data frames.
Upvotes: 4