Conner Russell
Conner Russell

Reputation: 1

Summing values under a specific cell in pandas?

I have a dataframe that looks like this, only is significantly larger. This is also part of a recurring issue I have to run monthly, so the values will always be changing:

| Name      | Category | Sales|
|-----------|----------|------|
| Product 1 | Sports   | 50   |
|   Friends |          | 30   |
|   Family  |          | 20   |
| Product 2 | Sports   | 50   |
|   Friends |          | 30   |
|   Family  |          | 20   |
| Product 3 | Politics | 50   |
|   Friends |          | 30   |
|   Family  |          | 20   |

Basically, I'd like to see the total number of sales, based on category and subgroup (friends/family). I feel I might've gotten closer by replacing the blank cells with the values above it, so the Category column is populated, but I'm not quite sure how to proceed from there.

Any ideas or areas for me to investigate?

Upvotes: 0

Views: 702

Answers (2)

jadore801120
jadore801120

Reputation: 77

your thought is close to the answer. First of all, we need to fill up the blank Category cell for each row, and we can easily do that with keep the previous non-empty cell content.

There are two ways to do that.

1. Loop over all the rows with apply()

You can use the apply() function of dataframe to assign new values to a column.

prev_category = None
def follow_previous_nonempty_row(category):
    global prev_category
    if category:
        prev_category = category
        return category
    else:
        return prev_category
    
df.Category = df.Category.apply(follow_previous_nonempty_row)

2. Fill up only the NA cell with fillna()

With the setting ffill or pad and along axis 0, fillna() will copy the previous row valid value into the NA cell.

df.Category = df.Category.fillna(method='ffill')

And you can find out that all the empty category cells are filled with values. Therefore, you can easily perform any conditioned summation now!

Upvotes: 0

Michael Silverstein
Michael Silverstein

Reputation: 1843

One of the hardest parts of using dataframes is having the data in the right format. The best way to represent this data for working in pandas may be something like this:

| product |   group | category | sales |
| :-------|--------:|----------|-------|
|    1    | friends | sports   | 30    |
|    1    |  family | sports   | 20    |
|    2    | friends | sports   | 30    |
| 2       | family  | sports   | 20    |
| 3       | friends | politics | 30    |
| 3       | family  | politics | 20    |

This places one item per row and eliminates replicating information.

After representing your data in this format, I would recommend using Pandas's groupby function:

df.groupby(['category', 'group']).sales.sum() would give you the total number of sales for each category, group combination.

Upvotes: 1

Related Questions