Wizard
Wizard

Reputation: 79

Looping over a list of dictionaries with DataFrames

I have a list with input files stored as dictionaries that represent imported Excel files. Each dictionary has multiple sheets stored as DataFrames.

I want to store the same key (sheet) of different dictionaries (Excel files) appended to each other while breaking the sheet when three NaN values appear in a column. I initialise a new dictionary, that should have the same keys (sheets) as the dictionaries (Excel files), but now all data is appended to this single dictionary d_sheets

To do this, I use the code below:

input_files = []
for file in read_input:
    input_file = pd.read_excel(io=file, sheet_name=needed_sheets, dtype=str)
    input_files.append(input_file)

d_sheets = {}
for dictionary in input_files:
      for sheet_name in sorted(dictionary):
            d_sheets[sheet_name] = pd.DataFrame()
            if sheet_name != 'Sheetname1':
                cell = dictionary[sheet_name]['Columnname1']
            else:
                cell = dictionary[sheet_name]['Columnname2']
            three_NaNs = cell.isna() & cell.shift(-1).isna() & cell.shift(-2).isna()
            first_instance = cell[three_NaNs].index.min()
            good_data = dictionary[sheet_name][cell.index <= first_instance]
            d_sheets[sheet_name].append(good_data)
            d_sheets[sheet_name] = pd.concat([d_sheets[sheet_name], good_data], axis=0)

It appears that for a single dictionary (Excel file) the boolean statements do their work and the keys are stored in d_sheets. However, I cant find a working way to loop over the dictionaries in input_files.

What might cause the problem that I cannot iterate over the items of the list input_files and store all sheets in d_sheets?

Upvotes: 2

Views: 250

Answers (2)

Wizard
Wizard

Reputation: 79

I have solved the issue as the DataFrame initialization should not be inside the for-loop in d_sheets[sheet_name] = pd.DataFrame(). This way, each iteration the loop clears the DataFrame and then performs the calculations. In the end, this results in the latest Excel file only as output, instead of all the files merged.

Upvotes: 0

Gillu13
Gillu13

Reputation: 958

If I understand correctly your "breaking sheet" strategy (pardon me if I did not understand) here is a script that might do the job:

import pandas as pd

# helper function to break sheets with 3 consecutive NaN in any column
def df_breaker(df):
    res = pd.DataFrame()
    nan_ind = pd.DataFrame({k:[0] for k in df.columns})
    for row in df.iloc:
        nan_ind=(row.isna()*nan_ind)+row.isna()
        if (nan_ind.iloc[0]>=3).any():
            return res.iloc[:-2]
        res=res.append(row)
    return res

# shall be broken after second row
dfa = pd.DataFrame({'c1':[0,1,None,None,None],'c2':[5,6,7,8,9]})
# shall not be broken 
dfb = pd.DataFrame({'c1':[10,None,12,None,14],'c2':[None,16,None,18,19]})
# shall not be broken 
dfc = pd.DataFrame({'c1':[20,21,22,23,24],'c2':[25,26,27,28,29]})
# shall not be broken 
dfd = pd.DataFrame({'c1':[30,31,32,33,34],'c2':[35,36,37,38,39]})

input_files = [{'sheet1':dfa, 'sheet2':dfb},{'sheet1':dfc, 'sheet2':dfd}]


d_sheets ={}
for key in input_files[0]:
    d_sheets[key]=pd.concat([df_breaker(k[key]) for k in input_files])

Upvotes: 1

Related Questions