Use Python to Make a Matrix (Feature Matrix?)

Let's say I have an excel file that has columns like:

Name      Day of Shopping     Food bought      
 Bob       March 14, 2015           Apple            
 Bob       March 14, 2015            Pear                
 Joe       April 15, 2013           Apple             
 Bob        June 28, 2017     Pomegranite   
Jake      January 4, 2008          Orange   
Jake        April 9, 2010           Peach
 Joe     December 6, 2016          Banana

And another excel file that reads:

Name      Day of Shopping     Drink bought       
 Bob       March 14, 2015            Water    
 Joe       April 15, 2013            Juice    
 Bob        June 28, 2017             Soda
Jake      January 4, 2008            Water
Jake        April 9, 2010              Tea
 Joe     December 6, 2016             Soda

What I want to do is make this into a new excel sheet or just a matrix that reads something like:

Name   Apple  Pear  Pomegranite Orange Peach Banana Water Juice Soda Tea
Bob     1      1     0           0       0     0     1      0    0    0 
Joe     1      0     0           0       0     0     0      1    0    0 
Jake    0      0     0           1       0     0     1      0    0    0

So that in the end I get a matrix that has the names, and the following columns list out the food options and the values that the names are assigned (1 or 0) are for whether or not the item was bought for that day. Notice that in the matrix that I described above only values for the first date was found even though each person had a different initial date. More columns would describe the other dates.

Please help.

Upvotes: 2

Answers (3)

DJK

Reputation: 9264

You can use pd.crosstab to do this. Ill assume you loaded your data into pandas and named your dataframe's df1 and df2. First rename both columns Food bought and Drink bought to a similar name (I chose Item) to easily concatenate values

df1.rename(columns={'Food bought':'Item'},inplace=True)
df2.rename(columns={'Drink bought':'Item'},inplace=True)
df = pd.concat([df1,df2])

Then compute the cross tabulation

pd.crosstab(df.Name,df.Item)

Which will output

Item  Apple  Banana  Juice  Orange  Peach  Pear  Pomegranite  Soda  Tea  Water
Name                                                                          
Bob       1       0      0       0      0     1            1     1    0      1
Jake      0       0      0       1      1     0            0     0    1      1
Joe       1       1      1       0      0     0            0     1    0      0

Moving on step further, you can compute the cross tabulation grouped by the date. To do this in pandas we need to convert Day of Shopping into datetime objects

df['Day of Shopping'] = pd.to_datetime(df['Day of Shopping'])

Now just add this information to pd.crosstab

pd.crosstab([df['Day of Shopping'],df.Name],df.Item)

Item                  Apple  Banana  Juice  Orange  Peach  Pear  Pomegranite  \
Day of Shopping Name                                                           
2008-01-04      Jake      0       0      0       1      0     0            0   
2010-04-09      Jake      0       0      0       0      1     0            0   
2013-04-15      Joe       1       0      1       0      0     0            0   
2015-03-14      Bob       1       0      0       0      0     1            0   
2016-12-06      Joe       0       1      0       0      0     0            0   
2017-06-28      Bob       0       0      0       0      0     0            1   
Item                  Soda  Tea  Water  
Day of Shopping Name                    
2008-01-04      Jake     0    0      1  
2010-04-09      Jake     0    1      0  
2013-04-15      Joe      0    0      0  
2015-03-14      Bob      0    0      1  
2016-12-06      Joe      1    0      0  
2017-06-28      Bob      1    0      0

Upvotes: 1

BallpointBen

Reputation: 13750

Assuming you have the first spreadsheet in a data frame named df1 and the second in df2, the correct way to do this would be:

import pandas as pd
df = pd.concat([df1, df2])
pd.get_dummies(df, columns=['Food bought'], 'Drink bought']).groupby('Name').sum()

Explanation: first use pd.concat to combine the two tables vertically. Then, use pd.get_dummies to one-hot encode the values in the "bought" columns -- this will add a column for each unique food and drink found, and a 1 when that row contains the value (so only one 1 per row). Then groupby "Name" to condense down into names, and finally take the sum for each person to aggregate the total items bought into one row for each person.

To read excel files into python, use pd.read_excel

Upvotes: 0

adapap

Reputation: 665

In terms of reading your excel sheets, you can export them as CSV files and use Python's CSV Parser to read your values from the sheet. Personally, I would use a defaultdict from the collections module. This allows you to set default values for your dictionary keys. You can create a default dict with a default value of 0 for every unique name that you have, and as they buy new products, increase their values by 1. To create the output, you iterate through each customer, and then you iterate through the list of products that they can buy -- since it is a defaultdict with 0 as the default value, you will not have issues of keys not matching.

Rough example code:

from collections import defaultdict

customers = {}

# Name represents the first column of your CSV entries, and product is the third column
for entry in csvdata:
    name = entry[0]
    if name not in customers:
        customers[name] = defaultdict(int) # Sets the default value to zero
    product = entry[2]
    customers[name][product] += 1

# Now you have a dictionary with customers as sub-dictionaries
# Each sub-dictionary has the count for each product

product_list = ["apples", "oranges", "pears"]

for customer, purchases in customers.items():
    for product in product_list:
        print(purchases[product])

Upvotes: 0

Use Python to Make a Matrix (Feature Matrix?)

Answers (3)

Related Questions