Reputation: 59
I am in need of help!
I have an Excel file that contains data I am trying to get into a dataframe, but the data is in a table form that is not easy to work with. Example:
I want to ultimately get it into a dataframe of this form:
Meal Food Calories
Breakfast English Muffins 120
Breakfast Peanut Butter Spread 190
Morning Snack Banana 90
Morning Snack Nectarine 59
... ... ...
And a separate dataframe for the Daily Totals in this form (ignore the 'Date' column for now):
Date Calories Fat Fiber Carbs Sodium Protein Water
2017-07-01 699 26 16 93 612 32 0
2017-07-02 ... ... ... ... ... ... ...
I am struggling with getting this into a dataframe. Looking at the screenshot of the dataset, it made sense to first store the data into a dictionary, but that leaves me with a bunch of NaN values, due to all the blank cells.
My thoughts on getting the 'Meal' column the way I want it to look is to do a forward fill, but that would mean I would have to use a Series or Dataframe, which I haven't gotten to yet.
This is what I currently have:
df = pd.read_excel('filename.xls', 'Foods')
# create a list to store the dictionaries
food_logs = []
# this is code to reformat the string values in a certain column
# to get the name of the sheets I need to use in the Excel. This can be ignored
for day in df.values:
if day[1] != '0':
foodLogSheetName = 'Food Log ' + day[0].replace('-', '')
food_logs.append(foodLogSheetName)
# 'foods' is now a list of nested dictionaries (think of everything in the
# first screenshot as the outer dictionary, and each of the column as the
# inner dictionary)
foods = [xls.parse(food_log).to_dict() for food_log in food_logs]
This is what 'foods' currently is, if I printed it out with a line between each of the outer dictionaries:
I have the option of working with a CSV file instead, but instead of multiple sheets, I would have multiple 'tables' stacked vertically, if that makes sense
I would greatly appreciate any tips anyone can provide, please!
Upvotes: 0
Views: 663
Reputation: 2032
I think you're on the right track with ffill-ing data. It sounds like you're might just have trouble with working with missing data. From the sample you posted, it looks like you can read the whole thing into a dataframe, drop all empty rows, ffill on the meal column, and then drop any rows that are partially empty (or on a subset.)
import pandas as pd
df = pd.read_excel(file_path_or_buffer, sheet_name=my_sheet_name, **other_kwargs)
# You should have a dataframe that looks like
# Meal Food Calories
# Breakfast
# English Muffins 120
# Peanut Butter Spread 190
# ...
# Next drop totally NaN/empty rows
df.dropna(how='all', inplace=True)
df['Meal'] = df['Meal'].fillna(method='ffill')
# Now you should have something that looks like
# Meal Food Calories
# Breakfast
# Breakfast English Muffins 120
# Breakfast Peanut Butter Spread 190
# ...
# Drop empty rows, if you need to allow for some sparse data, use the subset argument
df.dropna(how='any', inplace=True)
Upvotes: 1