Reputation: 43
I have a dataframe which contains four columns: ["Artist", "Album", "Title", "Point"] where the first three columns identify a song, and the fourth is a score. Each song may appear in a list multiple times, and some songs has no album information, which is a nan value in the corresponding cell. I use the following code to get the sum score for each song:
B = A.groupby(['Artist', 'Album', 'Title'])['Point'].sum()
C = B.sort_values(by=['Point'], ascending=[False]).reset_index(drop=True)
But what I have realized is, that if there is any row with the album information missing (nan), then that row is not appearing in variable B. How can I enforce nan to be a valid distinctive information? If I have the same song twice, once with album and once without, those two should appear as two distinct records after grouping.
Is there a way doing it properly, or do I have to do some pre-treatment of the data by forcing all "nan"s to become empty strings or something?
Upvotes: 1
Views: 575
Reputation: 8033
IIUC, you can do like this.
Input
Artist Album Point
0 AC1 A 1
1 AC2 B 2
2 AC1 NaN 3
3 AC4 A 4
4 AC5 C 5
a = df.groupby(['Artist'])['Album'].ffill()
df.groupby(['Artist', a ])['Point'].sum().reset_index(name='Points')
output
Artist Album Points
0 AC1 A 4
1 AC2 B 2
2 AC4 A 4
3 AC5 C 5
This code is probably what you can use
df['Album'] = df['Album'].astype(str)
df.groupby(['Artist','Album'])['Point'].sum().reset_index(name='Points')
output
Artist Album Points
0 AC1 A 1
1 AC1 nan 3
2 AC2 B 2
3 AC4 A 4
4 AC5 C 5
Upvotes: 0
Reputation: 36584
I would suggest a slight adjustment to your suggestion. Instead of turning then NaN
values into an empty string, you can replace them with Unknown
. I'll give you a reproducible example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Height':[20, 50, 20, 15, 10],
'Happy':['No', 'Yes', 'No', 'No', 'Yes'],
'Sex':['M', np.nan, 'M', 'M', 'F']})
print(df)
Height Happy Sex
0 20 No M
1 50 Yes NaN
2 20 No M
3 15 No M
4 10 Yes F
df.fillna('Unknown').groupby(['Sex', 'Happy'])['Height'].sum()
Out[8]:
Sex Happy
F Yes 10
M No 55
Unknown Yes 50
Name: Height, dtype: int64
In your case, I'd try:
B = A.fillna('Unknown').groupby(['Artist', 'Album', 'Title'])['Point'].sum()
Upvotes: 1
Reputation: 7204
.fillna('N/A') that will fix the searches. but with more data, there's probably a better solution
A['Album'].fillna('N/A', inplace=True)
you sums should work then
Upvotes: 2