Reputation: 79
I have a list with input files stored as dictionaries that represent imported Excel files. Each dictionary has multiple sheets stored as DataFrames.
I want to store the same key (sheet) of different dictionaries (Excel files) appended to each other while breaking the sheet when three NaN values appear in a column. I initialise a new dictionary, that should have the same keys (sheets) as the dictionaries (Excel files), but now all data is appended to this single dictionary d_sheets
To do this, I use the code below:
input_files = []
for file in read_input:
input_file = pd.read_excel(io=file, sheet_name=needed_sheets, dtype=str)
input_files.append(input_file)
d_sheets = {}
for dictionary in input_files:
for sheet_name in sorted(dictionary):
d_sheets[sheet_name] = pd.DataFrame()
if sheet_name != 'Sheetname1':
cell = dictionary[sheet_name]['Columnname1']
else:
cell = dictionary[sheet_name]['Columnname2']
three_NaNs = cell.isna() & cell.shift(-1).isna() & cell.shift(-2).isna()
first_instance = cell[three_NaNs].index.min()
good_data = dictionary[sheet_name][cell.index <= first_instance]
d_sheets[sheet_name].append(good_data)
d_sheets[sheet_name] = pd.concat([d_sheets[sheet_name], good_data], axis=0)
It appears that for a single dictionary (Excel file) the boolean statements do their work and the keys are stored in d_sheets
. However, I cant find a working way to loop over the dictionaries in input_files
.
What might cause the problem that I cannot iterate over the items of the list input_files
and store all sheets in d_sheets
?
Upvotes: 2
Views: 250
Reputation: 79
I have solved the issue as the DataFrame initialization should not be inside the for-loop in d_sheets[sheet_name] = pd.DataFrame()
. This way, each iteration the loop clears the DataFrame and then performs the calculations. In the end, this results in the latest Excel file only as output, instead of all the files merged.
Upvotes: 0
Reputation: 958
If I understand correctly your "breaking sheet" strategy (pardon me if I did not understand) here is a script that might do the job:
import pandas as pd
# helper function to break sheets with 3 consecutive NaN in any column
def df_breaker(df):
res = pd.DataFrame()
nan_ind = pd.DataFrame({k:[0] for k in df.columns})
for row in df.iloc:
nan_ind=(row.isna()*nan_ind)+row.isna()
if (nan_ind.iloc[0]>=3).any():
return res.iloc[:-2]
res=res.append(row)
return res
# shall be broken after second row
dfa = pd.DataFrame({'c1':[0,1,None,None,None],'c2':[5,6,7,8,9]})
# shall not be broken
dfb = pd.DataFrame({'c1':[10,None,12,None,14],'c2':[None,16,None,18,19]})
# shall not be broken
dfc = pd.DataFrame({'c1':[20,21,22,23,24],'c2':[25,26,27,28,29]})
# shall not be broken
dfd = pd.DataFrame({'c1':[30,31,32,33,34],'c2':[35,36,37,38,39]})
input_files = [{'sheet1':dfa, 'sheet2':dfb},{'sheet1':dfc, 'sheet2':dfd}]
d_sheets ={}
for key in input_files[0]:
d_sheets[key]=pd.concat([df_breaker(k[key]) for k in input_files])
Upvotes: 1