Reputation: 513

A whole sheet into a pandas dataframe with xlwings

Thanks to pandas, we could read a whole sheet into a data frame with the "read_excel" function.

I would like to use the same method using xlwings. In fact, my Workbook is already open and I don't want to use read_excel function (witch will take too long to execute by the way) but use the power of xlwings to save into a dataframe a whole sheet.

In fact with xlwings we could save a range into a dataframe. That mean I have to known the range size. But I guess there is a better (and quicker!) way to do that, isn't it?

Do you have some ideas to do that? Thanks a lot!

Edit: One exemple of one sheet I would like to transfer into a dataframe as read_excel would do it.

Name Point  Time    Power   Test1   Test2   Test3   Test4 ##
Test    0   1   10  4   24  144
        2   20  8   48  288
        3   30  12  72  432
        4   40  16  96  576
        5   50  20  120 720
        6   60  24  144 864
        7   70  28  168 1008
        8   80  32  192 1152
        9   90  36  216 1296
        10  100 40  240 1440
        11  110 44  264 1584
        12  120 48  288 1728

Upvotes: 9

Answers (7)

RedHairedGirl

Reputation: 1

You can try my function as below.

import pandas as pd
import os
import xlwings as xw


# =============================================================================

# INSTRUCTIONS FOR read_pw_excel

# Function can also read in CSV files. But as the underlying software is Excel, if the CSV file exceeds the maximum number of rows allowed in Excel, only the 1st row up to the limit will be read in.

 

# 1) Leave sheet blank to read in all sheets in the Excel file.

# Output will be a vertically stacked df of all sheets. Otherwise, specifiy the exact sheet to read in for df.

# If reading in all sheets,

   # Code will read in the 1st row of the 1st sheet as the header.

   # Whether subsequent sheets have header row or do not have header row (ie all rows are data), code will use the 1st row of the 1st sheet as the header.

# Code will drop columns with None as header (including the index column if it doesn't have header)

# Code will convert all header to upper case

 

# 2) Leave excel_range blank to read in the used range in the sheet(s). Otherwise specify the exact range like 'A:B'.

# 3) Leave pw blank if Excel file is not password protected

# =============================================================================

 

def read_pw_excel(link, sheet='', excel_range = '', pw=''):

 

    if not os.path.exists(link):

        print("link does not exist. Please try again.")

        raise Exception("LinkNotFoundError")

       

    df_merged=pd.DataFrame()

    col_names_first=''

    col_names_next=''

    lt_type=''

    index_col=''

    app = xw.App()

    app.display_alerts=False

   

    if pw=='':

        #filebook = xw.Book(link)

        filebook = app.books.open(link, update_links=False)

    else: filebook = app.books.open(link, update_links=False, password=pw)

   

    

    sht_ct = filebook.sheets.count

 

    for ct in range(0, sht_ct):

 

        if sheet=='':         

            if excel_range == '':

                range_to_use = filebook.sheets[ct].used_range  

            else: range_to_use = excel_range

       

            data = filebook.sheets[ct].range(range_to_use).value # Note this will be a list Eg [['Name', 'Age'], ['Tom', 4], ['Harry', 3]]. If there is only 1 column or 1 row, it will be ['Name', 'Tom', 'Harry'] or [ 'Tom', 'Tan', 6]

           

        else:

            if excel_range == '':

                range_to_use = filebook.sheets[sheet].used_range            

            else: range_to_use = excel_range

       

            data=filebook.sheets[sheet].range(range_to_use).value

 

 

 

        if ct==0:

           

            # if there is only 1 column, we convert header into a list: [['Name'], 'Tom', 'Harry'] - otherwise issue in converting to upper later when using [x.upper() if x is not None else x for x in data[0]]. Each letter of 'Name' will be separate element

            if not isinstance(data[0], list):

                data[0]=[data[0]]

                lt_type='single_col'

       

            

            pos_none= [pos for pos, x in enumerate(data[0]) if x == None] # data[0] is the header column. # when using used_range Python may read in some None columns, we find the position of these None columns

            if 0 in pos_none: # If column has NONE header, it is probably an index column. We will need an indicator to later remove this index column

                index_col='Y'

               

                

            col_names_first_ls=data[0]

            col_names_first_ls = [x for x in col_names_first_ls if x != None] # when using used_range Python may read in some None columns, we drop these

            col_names_first= ''.join(str(x).replace(' ', '').upper() for x in col_names_first_ls)

            

        else:

            

            # if there is only 1 column, we convert header into a list: [['Name'], 'Tom', 'Harry'] - otherwise issue in converting to upper later when using [x.upper() if x is not None else x for x in data[0]]. Each letter of 'Name' will be separate element

            if not isinstance(data[0], list) and lt_type=='single_col':

                pos_none=[]

              

                try:

                    data[0]=[data[0]]

                except: #TypeError: 'str' object does not support item assignment. Occurs when data is a single str/number (ie, there is only 1 column with 1 row)

                    data=[[data]]

                   

                col_names_next_ls=data[0]

               

                

                

            elif not isinstance(data[0], list) and lt_type!='single_col':  # imply there is only 1 data row without header

                col_names_next_ls=data

                col_names_next_ls = [x for x in col_names_next_ls if x != None]

           

            else: # imply there is > 1 row and >1 col

                pos_none= [pos for pos, x in enumerate(data[0]) if x == None]

                

                col_names_next_ls=data[0]

                col_names_next_ls = [x for x in col_names_next_ls if x != None]

               

            

            col_names_next= ''.join(str(x).replace(' ', '').upper() for x in col_names_next_ls)

           

       

        

       

        # for 1st sheet and 2nd onwards sheets which have same  header as 1st sheet

        if (ct==0) or ((ct!=0) & (col_names_first==col_names_next)):

            

            data[0]=[x for x in data[0] if x != None] # we remove any None header

 

            # IF THERE IS NONE header column read in, we remove all the data (should all be None) for that column

            if pos_none:

                for pos, x in enumerate(data[1:]):

                    i=0

                    for n in pos_none:

                        n=n+i

                        x.pop(n)

                        i-=1

 

            cols=[str(x).upper() if x is not None else x for x in data[0]] # this line will have issue if data[0]  is  a single string of the column name

            # cols=[]

            # if isinstance(data[0], list):

            #     for x in data[0]:

            #         if x is not None:

            #             cols.append(x.upper())

            #         else: cols.append(x)

           

            # elif not isinstance(data[0], list): cols.append(data[0].upper())

               

            df = pd.DataFrame(data[1:], columns = cols)

            df = df[df.columns.dropna()]  # drop columns with None as header. Some Excel file have >1 None columns, duplicate columns will have issue with concat

            #df.columns = map(str.upper, df.columns) # May not work if header has None 

            df_merged=pd.concat([df_merged, df],axis=0, ignore_index=True)

           

        elif (ct!=0) & (col_names_first!=col_names_next): # in case the 2nd onwards sheets don't have header (ie. 1st row is data), we need to add the header of the 1st sheet to it in order to concat

            if lt_type=='single_col': data[0]=data[0][0] # if only 1 column, the 1st row data shan't be a list [['Tom'], 'Dick']

               

            if len (col_names_next_ls) > len (col_names_first_ls): # Possibly some None header Column read in for ct!=. It could be an index column

                # IF THERE IS an index column read in, we remove all the data for that column and don't include it

       

                if index_col=='Y':

                    for x in data[0:]:

                        try:

                            x.pop(0)

                        except:  #AttributeError: 'float'/ 'str' object has no attribute 'pop'. This occurs if it is a single row. And data is a single list (not an embedded list) Eg [10.0, 'tom_left', 20.0, 200.0, 31.0]

                            data.pop(0)

                            break

                       

                new_col_ls=col_names_first_ls

 

                      

            elif len (col_names_next_ls) < len (col_names_first_ls):

                print("Number of column in Sheet " + str(ct) + " is fewer than in Sheet 0.")

                raise Exception("Number of Columns in Sheets Differ Error")

              

               

            elif len (col_names_next_ls) == len (col_names_first_ls):

                 new_col_ls=col_names_first_ls

             

        

            cols=[x.upper() if x is not None else x for x in new_col_ls]

           

            try:

                df = pd.DataFrame(data[0:], columns = cols)

 

            except: # when data[0:] is a single list with >1 col, sometimes will have ValueError: Shape of passed values is (x, 1), indices imply (x, x)  Then will have to nest it in a list

                df = pd.DataFrame([data[0:]], columns = cols)

          

            df = df[df.columns.dropna()]

            #df.columns = map(str.upper, df.columns)

            df_merged=pd.concat([df_merged, df],axis=0, ignore_index=True)

        

                

                

            

            

        if sheet!='': break

 

    filebook.close()

    app.quit()

   

    return df_merged

Upvotes: 0

Hong승원

Reputation: 146

I made it myself. It is a function that converts data in the range of startCol and endCol into Pandas data frames by selecting a specific cell location in the search parameter. I'm using it classified, so I don't know if it's running well, but I hope it helps.

When I use it, I don't use execlFilePath, sheetName, startCol, endCol, but I use integrated cell_range to specify the range, but if I add these integrated functions, the code will be long, so I pull it out and upload it.

import re
import pandas as pd
import xlwings as xw

def GetDataFrame(execlFilePath, sheetName, startCol=None, endCol='Z', search='A1'):
    r'''
    execlFilePath : 엑셀 파일 경로
    sheetName : 시트 이름
    startCol : 시작 열
    endCol : 끝 열
    search : 특정 데이터 검색할 cell위치
    '''
    # search cell 행열 위치 분해
    searchCOL, searchROW = re.match(r'([A-Z]+)(\d+)', search).groups()

    # 시작열이 없을 경우
    if not startCol:
        startCol = searchCOL

    workbook = xw.Book(execlFilePath)
    # 시트 선택
    worksheet: xw.main.Sheet = workbook.sheets[sheetName]
    # 마지막 데이터 열번호
    row = worksheet.range(search).end('down').row
    # 데이터 가져오기
    data = worksheet.range(f'{startCol}{searchROW}:{endCol}{row}').value
    # 판다스 데이터프레임으로 변환
    return pd.DataFrame(data[1:], columns=data[0])

Upvotes: 0

TheLittleNaruto

Reputation: 8473

xlwings does provide api to load whole sheet. To do that, use used_range api which reads whole used part of the sheet. (Of course we don't want to get unused rows values, do we? ;-)) Anyway here is a snippet code on how to do it:

import pandas as pd
import xlwings as xw

workbook = xw.Book('some.xlsx')
sheet1 = workbook.sheets['sheet1'].used_range.value
df = pd.DataFrame(sheet1)

That's all.

Upvotes: 8

李田鑫

Reputation: 11

I spent more time reading a 20M Excel using pandas.read_excel. But xlwings reads Excel very quickly. I will consider reading with xlwings and converting to a Dataframe. I think I have the same needs as the title owner. Xlwings made some adjustments during the four years. So I made some changes to the code of the first answerer. `

import xlwings as xw
import pandas as pd

def GetDataFrame(wb_file,Sheets_i,N,M):
    wb = xw.books(wb_file)   #open your workbook
         #Specify the value of the cell of the worksheet
    Data=wb.sheets[Sheets_i].range((1,1),(N,M)).value  
    Data=pd.DataFrame(Data)
    Data=Data.dropna(how='all',axis=1)
    Data=Data.dropna(how='all',axis=0)
    return Data

Upvotes: 1

kateryna

Reputation: 1289

You can use built-in converters to bring it in one line:

df = sht.range('A1').options(pd.DataFrame, 
                             header=1,
                             index=False, 
                             expand='table').value

Upvotes: 22

Coolpix

Reputation: 513

In fact, I could do something like that :

import xlwings as xw
import pandas as pd

def GetDataFrame(Sheet,N,M):
    wb = xw.Workbook.active()
    Data=xw.Range(Sheet,(1,1),(N,M)).value
    Data=pd.DataFrame(Data)
    Data=Data.dropna(how='all',axis=1)
    Data=Data.dropna(how='all',axis=0)
    return Data

Upvotes: 4

Mike Müller

Reputation: 85612

You can read from multiple sheets with pandas:

excel_file = pd.ExcelFile('myfile.xls')
df1 = excel_file.parse('Sheet1')
df2 = excel_file.parse('Sheet2')

So, just open one file after the other, read from the sheets you want and process the data frames.

Upvotes: 4

A whole sheet into a pandas dataframe with xlwings

Answers (7)

Related Questions