Reputation: 1
I have a dataframe that looks like this, only is significantly larger. This is also part of a recurring issue I have to run monthly, so the values will always be changing:
| Name | Category | Sales|
|-----------|----------|------|
| Product 1 | Sports | 50 |
| Friends | | 30 |
| Family | | 20 |
| Product 2 | Sports | 50 |
| Friends | | 30 |
| Family | | 20 |
| Product 3 | Politics | 50 |
| Friends | | 30 |
| Family | | 20 |
Basically, I'd like to see the total number of sales, based on category and subgroup (friends/family). I feel I might've gotten closer by replacing the blank cells with the values above it, so the Category column is populated, but I'm not quite sure how to proceed from there.
Any ideas or areas for me to investigate?
Upvotes: 0
Views: 702
Reputation: 77
your thought is close to the answer.
First of all, we need to fill up the blank Category
cell for each row, and we can easily do that with keep the previous non-empty cell content.
There are two ways to do that.
apply()
You can use the apply()
function of dataframe to assign new values to a column.
prev_category = None
def follow_previous_nonempty_row(category):
global prev_category
if category:
prev_category = category
return category
else:
return prev_category
df.Category = df.Category.apply(follow_previous_nonempty_row)
fillna()
With the setting ffill
or pad
and along axis 0, fillna()
will copy the previous row valid value into the NA cell.
df.Category = df.Category.fillna(method='ffill')
And you can find out that all the empty category cells are filled with values. Therefore, you can easily perform any conditioned summation now!
Upvotes: 0
Reputation: 1843
One of the hardest parts of using dataframes is having the data in the right format. The best way to represent this data for working in pandas may be something like this:
| product | group | category | sales |
| :-------|--------:|----------|-------|
| 1 | friends | sports | 30 |
| 1 | family | sports | 20 |
| 2 | friends | sports | 30 |
| 2 | family | sports | 20 |
| 3 | friends | politics | 30 |
| 3 | family | politics | 20 |
This places one item per row and eliminates replicating information.
After representing your data in this format, I would recommend using Pandas's groupby function:
df.groupby(['category', 'group']).sales.sum()
would give you the total number of sales for each category
, group
combination.
Upvotes: 1