looping through workbook with multiple sheets and extracting values in python

Question

I have an excel spreedsheet with multiple workbooks, and each workbook has the following attributes:

data = pd.DataFrame({
'start date': ['1988/01/12', '1988/01/13', '1988/01/14', '1989/01/20','1990/01/01'],
'end date': ['1989/01/12', '1989/01/12', '1989/01/12', '1990/01/20','1991/01/01'],
'value': [11558522, 12323552, NaN, 18412280, NaN]
})

I want to loop through each sheet in the workbook and:

get the sheetname (this is also the id of the sheet0
get the min date (to see when that sheet started)
get the max date (to see when that sheet ended)
count the number of missing values in that sheet (basesd on the value column only)

I want a summary spreedsheet/dataframe where iIhave the sheetname(id), start date, end date and the number of missing values like this:

data = pd.DataFrame({
sheet name': ['sheet 1', 'sheet 2', 'sheet 3', 'sheet 4','sheet5'],
'start date': ['1988/01/12', '1988/01/13', '1988/01/14', '1989/01/20','1990/01/01'],
'end date': ['1989/01/12', '1989/01/12', '1989/01/12', '1990/01/20','1991/01/01'],
'missing values': [3, 10, 1, 43, 8]
})

jezrael · Accepted Answer

Use sheet_name in read_excel for all sheets to orderdict of DataFrames, convert columns to datetimes and get minimal and maximal and count number of misisng values by sum of mask created by Series.isna, last append to list and last create new DataFrame by constructor:

L = []
for s, df in pd.read_excel('test.xlsx', sheet_name=None).items():
    d = {'sheet name': s, 
         'start date': pd.to_datetime(df['start date']).min(),
         'end date': pd.to_datetime(df['end date']).max(),
         'missing values':df['value'].isna().sum()}
    L.append(d)

df1 = pd.DataFrame(L)

looping through workbook with multiple sheets and extracting values in python

Answers (1)

Related Questions