Reputation: 1677
Scenario: I put together this little Frankenstein of a code (with some awesome help from SO users) to get data from excel files and put into a pandas dataframe.
What I am trying to do: I am trying to get data from files that may contain one or more worksheets of data. After that I intend to organize the dataframe accordingly. For example:
date1 identifier 1 bid ask
date1 identifier 2 bid ask
date1 identifier 3 bid ask
date2 identifier 1 bid ask
date2 identifier 3 bid ask
date3 identifier 4 bid ask
date3 identifier 5 bid ask
Obs1: Each file can have values for "Bid", "Ask" or both, each in a separate worksheet.
Obs2: The identifiers and dates may or may not be the same across files.
What I did so far: My current code reads the files, and each worksheet. If it follows the condition, it attaches to a specific dataframe. Then it fixes the column headings.
Issue: When my code runs, it yields two empty dataframes for some reason.
Question: How can I account for different worksheets and output the values accordingly (to the structure above) to a dataframe?
Current Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob, os
import datetime as dt
from datetime import datetime
import matplotlib as mpl
from openpyxl import load_workbook
directory = os.path.join("C:\\","Users\\DGMS\\Desktop\\final 2")
list_of_dfs = []
dfbid = pd.DataFrame()
dfask = pd.DataFrame()
for root,dirs,files in os.walk(directory):
for file in files:
f = os.path.join(root, file)
wb = load_workbook(f)
for sheet in wb.worksheets:
if sheet == "Bid":
dfbid = pd.concat([dfbid, pd.read_excel(f, "Bid")])
for i in range(1,len(dfbid.columns)):
dfbid.columns.values[i] = pd.to_datetime(dfbid.columns.values[i])
elif sheet == "Ask":
dfask = pd.concat([dfask, pd.read_excel(f, "Ask")])
for i in range(1,len(dfask.columns)):
dfask.columns.values[i] = pd.to_datetime(dfask.columns.values[i])
Upvotes: 0
Views: 6441
Reputation: 7058
Separate the different things your code does in different functions.
datetime
This way you can check and inspect each step separately instead of have it all intertwined
import pandas as pd
from pathlib import Path
root_dir = Path(r"C:\Users\DGMS\Desktop\final 2")
files = root_dir.glob('**/*.xlsx')
Read each file and return the worksheets 'Bid'
and 'Ask'
, then generate 2 lists of Dataframes
def parse_workbook(file):
d = pd.read_excel(file, sheetname=None)
return d.get('Bid', None), d.get('Ask', None)
df_bid_dfs, df_ask_dfs = zip(*(parse_workbook(file) for file in files))
datetime
def parse_datetime(df):
for column_name, column in df.iteritems():
df[column_name] = pd.to_datetime(column)
return df
df_bid = pd.concat(parse_datetime(df) for df in df_bid_dfs if df)
df_ask = pd.concat(parse_datetime(df) for df in df_ask_dfs if df)
parse_datetime
and the concatenationdf1 = pd.DataFrame(['20170718'])
df2 = pd.DataFrame(['20170719'])
df_bid_dfs = (df1, df2)
pd.concat(parse_datetime(df) for df in df_bid_dfs)
0
0 2017-07-18
0 2017-07-19
Upvotes: 2