Reputation: 497
Is there a way I can read in all the sheets of xlsx files, then drop all buy the one with the most rows?
ideally I would like to find the sheet that literally contains the most DATA, as in size in bytes, but rows will work for now
I would also like to be able to do this by loading, for example, the head of the file, so that load times are less, but I can use .columns and len to find which sheet has the most columns
Upvotes: 0
Views: 1049
Reputation: 34086
Read all sheets of an Excel file and store it in a dictionary
:
xls = pd.ExcelFile('excel_file_path.xls')
sheet_to_df_map = {}
for sheet_name in xls.sheet_names:
sheet_to_df_map[sheet_name] = xls.parse(sheet_name)
Now, you can loop over your dictionary and create another dictionary with row count of dataframes like this:
row_count_dict = {}
for key,val in sheet_to_df_map.items():
row_count_dict[key] = val.shape[0]
Then find the max of the row_count_dict
by value:
df_with_max_rows = max(row_count_dict, key=row_count_dict.get)
Then lookup the original dict with this key to get the dataframe with max_rows:
df = sheet_to_df_map.get(df_with_max_rows)
This will be your final dataframe.
Upvotes: 1
Reputation: 376
This one is very similar to Mayank's answer, only it should be slightly faster since it does not create a dictionary and store the parsed dataframes in it.
wb = pd.ExcelFile(r'path.xlsx')
max_row_count = 0
for sheet in wb.sheet_names:
parsed_sheet = wb.parse(sheet)
row_count = len(parsed_sheet)
if row_count>max_row_count:
max_row_count = row_count
df = parsed_sheet
Upvotes: 0
Reputation: 497
Unfortunately, reading in less rows with nrows =1 for example, does not seem to reduce the initial load time. The following code will find you the sheet with the largest number of rows, and then save it. Note that if more than 1 sheet is tied for the largest number of rows, only the last one would be saved
read_in = pd.read_excel(items, sheet_name = None)
keys = []
for ijk in read_in.keys():
keys.append(ijk)
list_lens = []
for key in keys:
list_lens.append(len(read_in[key]))
counter = 0
for lengths in list_lens:
if lengths == max(list_lens):
if len(read_in[keys[counter]])>0:
writer = pd.ExcelWriter(filename, engine='xlsxwriter',options={'strings_to_urls': False})#this ensures you can save longer links that excel tries to make hyperlinks, you can skip it and provide the file name instead of writer below for most code
read_in[keys[counter]].to_excel(writer, index = False)
writer.close()#unnecessary if you just gave the file name in the line above instead of using the writer object
counter = counter +1
whats happening here is that reading in with sheets = None reads them all in, and creates a dictionary where the keys are the sheet names, and the values are the dataframes of the whole sheet. you then access each dataframe, checking the number of columns. Then the code uses that info to save only the dataframe that can be accessed using the key that ends up with the most columns
Upvotes: 0