Pandas Merge of Dataframes

Question

I am looking to compare a group of data to the rolled up aggregate of that data.

In the example below, I want to know how much money each restaurant makes as compared to the total for all restaurants. I want to know this by day. If a restaurant is closed that day, I still want to return the name of the day and the name of the restaurant with a NaN (or a zero) for the total bill column.

(I know that there are other ways to do this, but please consider this a question on merging, as there is a reason it needs to be done with two different dataframes)

So, the output I would want is:

day   total_bill_x   restaurant   total_bill_y

Fri        651.76   DINER A     325.88 
Sat      3,556.80   DINER A   1,778.40 
Sun      1,627.16   DINER A        NaN 
Thur     2,192.66   DINER A   1,096.33

But the furthest I can get would return something like the following due to the nature of how outer joins work:

day   total_bill_x   restaurant   total_bill_y

Fri        651.76   DINER A   325.88 
Sat      3,556.80   DINER A   1,778.40 
Sun      1,627.16   NaN        NaN 
Thur     2,192.66   DINER A   1,096.33

The problem is that there is no merge that I can think of that would support this type of output - DINER A will just disappear if there isn't a match.

It gets worse with my real data set, which will consist of dozens of restaurants.

Is it possible to do an outer join that would grab fields from the table being joined to if there isn't match? How can I see all of the records from the larger table for the records in the smaller table where there isn't a match?

I think it is an interesting problem and definitely interested in how others would handle.. Thanks!

Sample Code Follows:

import pandas as pd
df=pd.read_csv("https://raw.githubusercontent.com/wesm/pydata-book/master/ch08/tips.csv", sep=',')
df2=pd.read_csv("https://raw.githubusercontent.com/wesm/pydata-book/master/ch08/tips.csv", sep=',')

df=df[df['day']!="Sun"]
df['restaurant']="DINER A"
df3=df.append(df2)
df_output=df.groupby(['restaurant','day'])[['total_bill']].sum().reset_index()
df_output2=df3.groupby(['day'])[['total_bill']].sum().reset_index()

pd.merge(df_output2,df_output, on='day', how="outer")

piRSquared · Accepted Answer

setup

df = pd.read_csv(
    "https://raw.githubusercontent.com/wesm/pydata-book/master/ch08/tips.csv", sep=',')

d1 = pd.concat([df.query('day != "Sun"'), df], keys=['DINER A', 'DINER B']) \
    .rename_axis(['restaurant', None]).reset_index('restaurant')

build a `pd.MultiIndex` with all restaurants and days

mux = pd.MultiIndex.from_product([
        d1.restaurant.unique(),
        d1.day.unique()
    ], names=['restaurant', 'day'])

do `groupby` + `join` + `reindex`

d2 = d1.groupby(['day']).total_bill.sum()
d3 = d1.groupby(['restaurant', 'day'])[['total_bill']].sum()

d3.reindex(mux).join(d2, lsuffix='_x', rsuffix='_y').reset_index()

  restaurant   day  total_bill_x  total_bill_y
0    DINER A   Sat       1778.40       3556.80
1    DINER A  Thur       1096.33       2192.66
2    DINER A   Fri        325.88        651.76
3    DINER A   Sun           NaN       1627.16
4    DINER B   Sat       1778.40       3556.80
5    DINER B  Thur       1096.33       2192.66
6    DINER B   Fri        325.88        651.76
7    DINER B   Sun       1627.16       1627.16

Pandas Merge of Dataframes

Answers (2)

setup

build a `pd.MultiIndex` with all restaurants and days

do `groupby` + `join` + `reindex`

Related Questions

Pandas Merge of Dataframes

Answers (2)

setup

build a pd.MultiIndex with all restaurants and days

do groupby + join + reindex

Related Questions

build a `pd.MultiIndex` with all restaurants and days

do `groupby` + `join` + `reindex`