Reputation: 161
Let's say I have an excel file that has columns like:
Name Day of Shopping Food bought
Bob March 14, 2015 Apple
Bob March 14, 2015 Pear
Joe April 15, 2013 Apple
Bob June 28, 2017 Pomegranite
Jake January 4, 2008 Orange
Jake April 9, 2010 Peach
Joe December 6, 2016 Banana
And another excel file that reads:
Name Day of Shopping Drink bought
Bob March 14, 2015 Water
Joe April 15, 2013 Juice
Bob June 28, 2017 Soda
Jake January 4, 2008 Water
Jake April 9, 2010 Tea
Joe December 6, 2016 Soda
What I want to do is make this into a new excel sheet or just a matrix that reads something like:
Name Apple Pear Pomegranite Orange Peach Banana Water Juice Soda Tea
Bob 1 1 0 0 0 0 1 0 0 0
Joe 1 0 0 0 0 0 0 1 0 0
Jake 0 0 0 1 0 0 1 0 0 0
So that in the end I get a matrix that has the names, and the following columns list out the food options and the values that the names are assigned (1 or 0) are for whether or not the item was bought for that day. Notice that in the matrix that I described above only values for the first date was found even though each person had a different initial date. More columns would describe the other dates.
Please help.
Upvotes: 2
Views: 6640
Reputation: 9264
You can use pd.crosstab
to do this. Ill assume you loaded your data into pandas and named your dataframe's df1
and df2
. First rename both columns Food bought
and Drink bought
to a similar name (I chose Item
) to easily concatenate values
df1.rename(columns={'Food bought':'Item'},inplace=True)
df2.rename(columns={'Drink bought':'Item'},inplace=True)
df = pd.concat([df1,df2])
Then compute the cross tabulation
pd.crosstab(df.Name,df.Item)
Which will output
Item Apple Banana Juice Orange Peach Pear Pomegranite Soda Tea Water
Name
Bob 1 0 0 0 0 1 1 1 0 1
Jake 0 0 0 1 1 0 0 0 1 1
Joe 1 1 1 0 0 0 0 1 0 0
Moving on step further, you can compute the cross tabulation grouped by the date. To do this in pandas we need to convert Day of Shopping
into datetime objects
df['Day of Shopping'] = pd.to_datetime(df['Day of Shopping'])
Now just add this information to pd.crosstab
pd.crosstab([df['Day of Shopping'],df.Name],df.Item)
Item Apple Banana Juice Orange Peach Pear Pomegranite \
Day of Shopping Name
2008-01-04 Jake 0 0 0 1 0 0 0
2010-04-09 Jake 0 0 0 0 1 0 0
2013-04-15 Joe 1 0 1 0 0 0 0
2015-03-14 Bob 1 0 0 0 0 1 0
2016-12-06 Joe 0 1 0 0 0 0 0
2017-06-28 Bob 0 0 0 0 0 0 1
Item Soda Tea Water
Day of Shopping Name
2008-01-04 Jake 0 0 1
2010-04-09 Jake 0 1 0
2013-04-15 Joe 0 0 0
2015-03-14 Bob 0 0 1
2016-12-06 Joe 1 0 0
2017-06-28 Bob 1 0 0
Upvotes: 1
Reputation: 13750
Assuming you have the first spreadsheet in a data frame named df1
and the second in df2
, the correct way to do this would be:
import pandas as pd
df = pd.concat([df1, df2])
pd.get_dummies(df, columns=['Food bought'], 'Drink bought']).groupby('Name').sum()
Explanation: first use pd.concat
to combine the two tables vertically. Then, use pd.get_dummies
to one-hot encode the values in the "bought" columns -- this will add a column for each unique food and drink found, and a 1 when that row contains the value (so only one 1 per row). Then groupby
"Name" to condense down into names, and finally take the sum
for each person to aggregate the total items bought into one row for each person.
To read excel files into python, use pd.read_excel
Upvotes: 0
Reputation: 665
In terms of reading your excel sheets, you can export them as CSV files and use Python's CSV Parser to read your values from the sheet. Personally, I would use a defaultdict from the collections
module. This allows you to set default values for your dictionary keys. You can create a default dict with a default value of 0 for every unique name that you have, and as they buy new products, increase their values by 1. To create the output, you iterate through each customer, and then you iterate through the list of products that they can buy -- since it is a defaultdict with 0 as the default value, you will not have issues of keys not matching.
Rough example code:
from collections import defaultdict
customers = {}
# Name represents the first column of your CSV entries, and product is the third column
for entry in csvdata:
name = entry[0]
if name not in customers:
customers[name] = defaultdict(int) # Sets the default value to zero
product = entry[2]
customers[name][product] += 1
# Now you have a dictionary with customers as sub-dictionaries
# Each sub-dictionary has the count for each product
product_list = ["apples", "oranges", "pears"]
for customer, purchases in customers.items():
for product in product_list:
print(purchases[product])
Upvotes: 0