AJG519
AJG519

Reputation: 3379

Pandas groupby with missing key

Say I have the following dataframe of addresses and phone numbers:

>>> import pandas as pd
>>> df=pd.DataFrame()
>>> df['Address']=['1 Main St','1 Main St','45 Spruce St','45 Spruce St','100 Green St','100 Green St', '500 Washington','500 Washington']
>>> df['Phone']=['555-5555','555-5555','666-6666','666-6667','777-7777',None,None,None]
>>> df
          Address     Phone
0       1 Main St  555-5555
1       1 Main St  555-5555
2    45 Spruce St  666-6666
3    45 Spruce St  666-6667
4    100 Green St  777-7777
5    100 Green St      None
6  500 Washington      None
7  500 Washington      None
>>> 

I want to groupby both the Address and Phone fields and get a count of addresses in each group. However, I have an added criteria: groups with the same address, but missing (not different) phone numbers should be considered as having the same address and phone numbers. In my example below, this means that the last record should be assumed to have phone number 777-7777 as well. Additionally, groups that have the same address, but neither have phone numbers should be grouped as well (Washington in my example).

When I groupby Address and Phone, None values are dropped:

>>> df.groupby(['Address','Phone']).size().reset_index().rename(columns={0:'Size'})
        Address     Phone  Size
0     1 Main St  555-5555     2
1  100 Green St  777-7777     1
2  45 Spruce St  666-6666     1
3  45 Spruce St  666-6667     1
>>> 

Essentially, I am looking to group by Address and Phone when all records within an address group have phone numbers, and just by address when phone values are missing. My desired output would be:

          Address     Phone
0       1 Main St  555-5555
1    45 Spruce St  666-6666
2    45 Spruce St  666-6667
3    100 Green St  777-7777
4  500 Washington      None
>>> 

Any suggestions as to how I could do this?

Upvotes: 2

Views: 289

Answers (1)

Ami Tavory
Ami Tavory

Reputation: 76346

Nice question.

How about the following:

def count_phones(g):
    distinct = len(g.dropna().unique())
    return distinct if distinct else 1

>>> df.Phone.groupby(df.Address).agg(count_phones)
Address
1 Main St         1
100 Green St      1
45 Spruce St      2
500 Washington    1
dtype: int64
  • Grouping is just by address.

  • Within each group, the phone numbers' non-null unique values are counted.

  • If the number is 0, then all were None, and we return 1.

Upvotes: 1

Related Questions