DGMS89
DGMS89

Reputation: 1677

Reading data from xlsx into Pandas dataframe

Scenario: I put together this little Frankenstein of a code (with some awesome help from SO users) to get data from excel files and put into a pandas dataframe.

What I am trying to do: I am trying to get data from files that may contain one or more worksheets of data. After that I intend to organize the dataframe accordingly. For example:

date1    identifier 1    bid     ask
date1    identifier 2    bid     ask
date1    identifier 3    bid     ask
date2    identifier 1    bid     ask
date2    identifier 3    bid     ask
date3    identifier 4    bid     ask
date3    identifier 5    bid     ask

Obs1: Each file can have values for "Bid", "Ask" or both, each in a separate worksheet.

Obs2: The identifiers and dates may or may not be the same across files.

What I did so far: My current code reads the files, and each worksheet. If it follows the condition, it attaches to a specific dataframe. Then it fixes the column headings.

Issue: When my code runs, it yields two empty dataframes for some reason.

Question: How can I account for different worksheets and output the values accordingly (to the structure above) to a dataframe?

Current Code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob, os
import datetime as dt
from datetime import datetime
import matplotlib as mpl
from openpyxl import load_workbook


directory = os.path.join("C:\\","Users\\DGMS\\Desktop\\final 2")        
list_of_dfs = []
dfbid = pd.DataFrame()
dfask = pd.DataFrame()

for root,dirs,files in os.walk(directory):

    for file in files:

        f = os.path.join(root, file)

        wb = load_workbook(f)

        for sheet in wb.worksheets:
            if sheet == "Bid":
                dfbid = pd.concat([dfbid, pd.read_excel(f, "Bid")])
                for i in range(1,len(dfbid.columns)):
                    dfbid.columns.values[i] = pd.to_datetime(dfbid.columns.values[i])

            elif sheet == "Ask":
                dfask = pd.concat([dfask, pd.read_excel(f, "Ask")])
                for i in range(1,len(dfask.columns)):
                    dfask.columns.values[i] = pd.to_datetime(dfask.columns.values[i])

Upvotes: 0

Views: 6441

Answers (1)

Maarten Fabré
Maarten Fabré

Reputation: 7058

Separate the different things your code does in different functions.

  • look for the excel-files
  • read the excel-files
  • convert the content to datetime
  • concatenate the DataFrames

This way you can check and inspect each step separately instead of have it all intertwined

Look for the excel-files

import pandas as pd
from pathlib import Path

root_dir = Path(r"C:\Users\DGMS\Desktop\final 2")

files = root_dir.glob('**/*.xlsx')

Read the excel-files

Read each file and return the worksheets 'Bid' and 'Ask', then generate 2 lists of Dataframes

def parse_workbook(file):
    d = pd.read_excel(file, sheetname=None)
    return d.get('Bid', None), d.get('Ask', None)

df_bid_dfs, df_ask_dfs = zip(*(parse_workbook(file) for file in files))

Convert the content to datetime

def parse_datetime(df):
    for column_name, column in df.iteritems():
        df[column_name] = pd.to_datetime(column)
    return df

Concatenate the DataFrames

df_bid = pd.concat(parse_datetime(df) for df in df_bid_dfs if df)
df_ask = pd.concat(parse_datetime(df) for df in df_ask_dfs if df)

testing parse_datetime and the concatenation

df1 = pd.DataFrame(['20170718'])
df2 = pd.DataFrame(['20170719'])
df_bid_dfs = (df1, df2)
pd.concat(parse_datetime(df) for df in df_bid_dfs)
    0
0   2017-07-18
0   2017-07-19

Upvotes: 2

Related Questions