Nice Micro
Nice Micro

Reputation: 43

How to preserve rows with empty (nan) cells when doing groupby on a dataframe in python pandas

I have a dataframe which contains four columns: ["Artist", "Album", "Title", "Point"] where the first three columns identify a song, and the fourth is a score. Each song may appear in a list multiple times, and some songs has no album information, which is a nan value in the corresponding cell. I use the following code to get the sum score for each song:

B = A.groupby(['Artist', 'Album', 'Title'])['Point'].sum()
C = B.sort_values(by=['Point'], ascending=[False]).reset_index(drop=True)

But what I have realized is, that if there is any row with the album information missing (nan), then that row is not appearing in variable B. How can I enforce nan to be a valid distinctive information? If I have the same song twice, once with album and once without, those two should appear as two distinct records after grouping.

Is there a way doing it properly, or do I have to do some pre-treatment of the data by forcing all "nan"s to become empty strings or something?

Upvotes: 1

Views: 575

Answers (3)

moys
moys

Reputation: 8033

IIUC, you can do like this.

Input

   Artist   Album   Point
0   AC1     A       1
1   AC2     B       2
2   AC1     NaN     3
3   AC4     A       4
4   AC5     C       5
a = df.groupby(['Artist'])['Album'].ffill()
df.groupby(['Artist', a ])['Point'].sum().reset_index(name='Points')

output

 Artist     Album   Points
0   AC1     A       4
1   AC2     B       2
2   AC4     A       4
3   AC5     C       5

This code is probably what you can use

df['Album'] = df['Album'].astype(str)
df.groupby(['Artist','Album'])['Point'].sum().reset_index(name='Points')

output

  Artist    Album   Points
0   AC1     A       1
1   AC1     nan     3
2   AC2     B       2
3   AC4     A       4
4   AC5     C       5

Upvotes: 0

Nicolas Gervais
Nicolas Gervais

Reputation: 36584

I would suggest a slight adjustment to your suggestion. Instead of turning then NaN values into an empty string, you can replace them with Unknown. I'll give you a reproducible example:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Height':[20, 50, 20, 15, 10],
                   'Happy':['No', 'Yes', 'No', 'No', 'Yes'],
                   'Sex':['M', np.nan, 'M', 'M', 'F']})
print(df)
   Height Happy  Sex
0      20    No    M
1      50   Yes  NaN
2      20    No    M
3      15    No    M
4      10   Yes    F
df.fillna('Unknown').groupby(['Sex', 'Happy'])['Height'].sum()
Out[8]: 
Sex      Happy
F        Yes      10
M        No       55
Unknown  Yes      50
Name: Height, dtype: int64

In your case, I'd try:

B = A.fillna('Unknown').groupby(['Artist', 'Album', 'Title'])['Point'].sum()

Upvotes: 1

oppressionslayer
oppressionslayer

Reputation: 7204

.fillna('N/A') that will fix the searches. but with more data, there's probably a better solution

A['Album'].fillna('N/A', inplace=True)

you sums should work then

Upvotes: 2

Related Questions