wuv1
wuv1

Reputation: 59

Turning multiple tables in CSV/Excel into dictionary or Dataframe in Python

I am in need of help!

I have an Excel file that contains data I am trying to get into a dataframe, but the data is in a table form that is not easy to work with. Example:

enter image description here

I want to ultimately get it into a dataframe of this form:

Meal               Food                              Calories
Breakfast          English Muffins                   120
Breakfast          Peanut Butter Spread              190
Morning Snack      Banana                            90
Morning Snack      Nectarine                         59
...                ...                               ...

And a separate dataframe for the Daily Totals in this form (ignore the 'Date' column for now):

Date          Calories     Fat     Fiber     Carbs     Sodium     Protein     Water
2017-07-01    699          26      16        93        612        32          0
2017-07-02    ...          ...     ...       ...       ...        ...         ...

I am struggling with getting this into a dataframe. Looking at the screenshot of the dataset, it made sense to first store the data into a dictionary, but that leaves me with a bunch of NaN values, due to all the blank cells.

My thoughts on getting the 'Meal' column the way I want it to look is to do a forward fill, but that would mean I would have to use a Series or Dataframe, which I haven't gotten to yet.

This is what I currently have:

df = pd.read_excel('filename.xls', 'Foods')

# create a list to store the dictionaries
food_logs = []

# this is code to reformat the string values in a certain column 
# to get the name of the sheets I need to use in the Excel. This can be ignored
for day in df.values:
    if day[1] != '0':
        foodLogSheetName = 'Food Log ' + day[0].replace('-', '')
        food_logs.append(foodLogSheetName)

# 'foods' is now a list of nested dictionaries (think of everything in the 
# first screenshot as the outer dictionary, and each of the column as the 
# inner dictionary)
foods = [xls.parse(food_log).to_dict() for food_log in food_logs]

This is what 'foods' currently is, if I printed it out with a line between each of the outer dictionaries: enter image description here

I have the option of working with a CSV file instead, but instead of multiple sheets, I would have multiple 'tables' stacked vertically, if that makes sense

I would greatly appreciate any tips anyone can provide, please!

Upvotes: 0

Views: 663

Answers (1)

Thtu
Thtu

Reputation: 2032

I think you're on the right track with ffill-ing data. It sounds like you're might just have trouble with working with missing data. From the sample you posted, it looks like you can read the whole thing into a dataframe, drop all empty rows, ffill on the meal column, and then drop any rows that are partially empty (or on a subset.)

import pandas as pd

df = pd.read_excel(file_path_or_buffer, sheet_name=my_sheet_name, **other_kwargs)
# You should have a dataframe that looks like
# Meal               Food                              Calories
# Breakfast          
#                    English Muffins                   120
#                    Peanut Butter Spread              190
# ...
# Next drop totally NaN/empty rows
df.dropna(how='all', inplace=True)
df['Meal'] = df['Meal'].fillna(method='ffill')
# Now you should have something that looks like
# Meal               Food                              Calories
# Breakfast          
# Breakfast          English Muffins                   120
# Breakfast          Peanut Butter Spread              190
# ...
# Drop empty rows, if you need to allow for some sparse data, use the subset argument
df.dropna(how='any', inplace=True)

Upvotes: 1

Related Questions