AlbinoRhino
AlbinoRhino

Reputation: 497

pandas.read_excel and to_excel to use only the largest sheet or sheet with most columns

Is there a way I can read in all the sheets of xlsx files, then drop all buy the one with the most rows?

ideally I would like to find the sheet that literally contains the most DATA, as in size in bytes, but rows will work for now

I would also like to be able to do this by loading, for example, the head of the file, so that load times are less, but I can use .columns and len to find which sheet has the most columns

Upvotes: 0

Views: 1049

Answers (3)

Mayank Porwal
Mayank Porwal

Reputation: 34086

Read all sheets of an Excel file and store it in a dictionary:

xls = pd.ExcelFile('excel_file_path.xls')
sheet_to_df_map = {}
for sheet_name in xls.sheet_names:
    sheet_to_df_map[sheet_name] = xls.parse(sheet_name)

Now, you can loop over your dictionary and create another dictionary with row count of dataframes like this:

row_count_dict = {}     

for key,val in sheet_to_df_map.items():
    row_count_dict[key] = val.shape[0]

Then find the max of the row_count_dict by value:

df_with_max_rows = max(row_count_dict, key=row_count_dict.get)

Then lookup the original dict with this key to get the dataframe with max_rows:

df = sheet_to_df_map.get(df_with_max_rows)

This will be your final dataframe.

Upvotes: 1

agastya
agastya

Reputation: 376

This one is very similar to Mayank's answer, only it should be slightly faster since it does not create a dictionary and store the parsed dataframes in it.

wb = pd.ExcelFile(r'path.xlsx')
max_row_count = 0
for sheet in wb.sheet_names:
    parsed_sheet  = wb.parse(sheet)
    row_count = len(parsed_sheet)
    if row_count>max_row_count:
        max_row_count = row_count
        df = parsed_sheet

Upvotes: 0

AlbinoRhino
AlbinoRhino

Reputation: 497

Unfortunately, reading in less rows with nrows =1 for example, does not seem to reduce the initial load time. The following code will find you the sheet with the largest number of rows, and then save it. Note that if more than 1 sheet is tied for the largest number of rows, only the last one would be saved

read_in = pd.read_excel(items, sheet_name = None)
keys = []
for ijk in read_in.keys():
    keys.append(ijk)
list_lens = []
for key in keys:
    list_lens.append(len(read_in[key]))
counter = 0
for lengths in list_lens:
    if lengths == max(list_lens):
        if len(read_in[keys[counter]])>0:
            writer = pd.ExcelWriter(filename, engine='xlsxwriter',options={'strings_to_urls': False})#this ensures you can save longer links that excel tries to make hyperlinks, you can skip it and provide the file name instead of writer below for most code
            read_in[keys[counter]].to_excel(writer, index = False)
            writer.close()#unnecessary if you just gave the file name in the line above instead of using the writer object
    counter = counter +1

whats happening here is that reading in with sheets = None reads them all in, and creates a dictionary where the keys are the sheet names, and the values are the dataframes of the whole sheet. you then access each dataframe, checking the number of columns. Then the code uses that info to save only the dataframe that can be accessed using the key that ends up with the most columns

Upvotes: 0

Related Questions