Tamarie
Tamarie

Reputation: 135

looping through workbook with multiple sheets and extracting values in python

I have an excel spreedsheet with multiple workbooks, and each workbook has the following attributes:

data = pd.DataFrame({
'start date': ['1988/01/12', '1988/01/13', '1988/01/14', '1989/01/20','1990/01/01'],
'end date': ['1989/01/12', '1989/01/12', '1989/01/12', '1990/01/20','1991/01/01'],
'value': [11558522, 12323552, NaN, 18412280, NaN]
}) 

I want to loop through each sheet in the workbook and:

  1. get the sheetname (this is also the id of the sheet0
  2. get the min date (to see when that sheet started)
  3. get the max date (to see when that sheet ended)
  4. count the number of missing values in that sheet (basesd on the value column only)

I want a summary spreedsheet/dataframe where iIhave the sheetname(id), start date, end date and the number of missing values like this:

data = pd.DataFrame({
sheet name': ['sheet 1', 'sheet 2', 'sheet 3', 'sheet 4','sheet5'],
'start date': ['1988/01/12', '1988/01/13', '1988/01/14', '1989/01/20','1990/01/01'],
'end date': ['1989/01/12', '1989/01/12', '1989/01/12', '1990/01/20','1991/01/01'],
'missing values': [3, 10, 1, 43, 8]
}) 

Upvotes: 2

Views: 700

Answers (1)

jezrael
jezrael

Reputation: 862601

Use sheet_name in read_excel for all sheets to orderdict of DataFrames, convert columns to datetimes and get minimal and maximal and count number of misisng values by sum of mask created by Series.isna, last append to list and last create new DataFrame by constructor:

L = []
for s, df in pd.read_excel('test.xlsx', sheet_name=None).items():
    d = {'sheet name': s, 
         'start date': pd.to_datetime(df['start date']).min(),
         'end date': pd.to_datetime(df['end date']).max(),
         'missing values':df['value'].isna().sum()}
    L.append(d)

df1 = pd.DataFrame(L)

Upvotes: 1

Related Questions