How to include zero-values with Pandas count and merge results with original dataframe

My dataset looks like this:

   Year     Month   Day         Category          Quantity
   1984     1        1          2                   10.5
   1984     1        1          6                   3.7
   1985     1        2          8                   4.8
   1985     2        1          3                   20
   1986     1        1          1                   9
   1986     2        1          18                  12.6
   1987     1        29         20                  2.8

Note that every day of every month of every year contains one unique entry. In other words each day can only have one category (not several).

I am trying to count how many times each category occurred every year.

However, using count in Pandas I realized zero-counts are not included. In other words, if a category did not take place one year it is not included. So in order to solve that I tried using: fill_value=0 (as shown in the below code).

I ended up with this (WARNING: don't run this code as is because it eats up all the memory apparently):

 import pandas as pd

    df = pd.read_csv("import.csv", header=0,
                        encoding='iso-8859-1')


    midx = pd.MultiIndex.from_product([
            df['Year'],
            df['Category']
            ], names=['Year', 'Category'])


    df['QuantityWithNaN'] = pd.to_numeric(df['Quantity'], errors='coerce')

    count_quantity_yearly_above_5 = df[df['QuantityWithNaN'] > 5.0].groupby(['Year', 'Category'])['Quantity'].count()

    count_quantity_yearly_above_5.reindex(midx, fill_value=0)

    df['count_quantity_yearly_above_5'] = df.apply(count_quantity_yearly_above_5,axis=1) 

df.to_csv("export.csv",encoding='iso-8859-1')

The datatypes of the dataframe df is after running that code:

The datatypes for the dataframe that is imported from the CSV is as:

Year                int64
Month               int64
Day                 int64
Category            int64
Quantity            object
QuantityWithNaN     float64

The end result should thus be like, but is not achieved with the code above. (The end result must not be sorted in any specific order, only important thing is that all categories appear each year):

Year        Month   Day     Category   Quantity count_quantity_yearly_above_5

   1984     1       1           1           10.5                        2
   1984     1       1           2           3.7                         7
   1984     1       2           3           4.8                         1
   1985     2       1           1           20                          9
   1985     1       1           2           9                           1
   1986     2       1           3           12.6                        4
   1987     1       29          20          2.8                         5
   1988                         10           2                          0

Also for visualization and the information that matters in the end will entirely be given by the columns as follows, so zero-counts are included and there are no duplicate rows for each combination of year and category (obviously I am to lazy, and it would take a lot more space to include all categories (1-20) every year here):

Year                    Category                count_quantity_yearly_above_5

   1984                     1                                               2
   1984                     2                                               7
   1984                     3                                               1
   1985                     1                                               9
   1985                     2                                               1
   1986                     3                                               4
   1987                     20                                              5
   1988                     13                                              0

Now I end up with a separate series object (count_quantity_yearly_above_5) which I want to insert into the original dataframe df.

With reindex I want to reduce the number of rows so that there is only one row per unique combination of year and category, which means that each combination of year and category only appear once (in other words for each year each category is only represented once).

Apparently fill_value=0 should tell pandas count to include zero-counts.

Obviously something is very wrong with the code since all memory is consumed when running it which I suspect is due to one of these lines in the code:

count_quantity_yearly_above_5.reindex(midx, fill_value=0)

df['count_quantity_yearly_above_5'] = df.apply(count_quantity_yearly_above_5,axis=1)

EDIT

The main problem is that I can't get count_quantity_yearly_above_5-column into the original dataframe, which probably has to do with the fact that count_quantity_yearly_above_5 is a series object. Now I am obviously not importing the series object into the original dataframe correctly. Any suggestions on how this code could be adjusted?

Running just the line (df['count_quantity_yearly_above_5'] = df.apply(count_quantity_yearly_above_5,axis=1)) returns the error:

TypeError: ("'Series' object is not callable", 'occurred at index 0')

EDIT 2

I just figured out which line causes 100 % memory usage:

count_quantity_yearly_above_5.reindex(midx, fill_value=0)

Upvotes: 1

Views: 2054

Answers (1)

joejoemac
joejoemac

Reputation: 165

You probably want to use a groupby.

The following will return the final table you wanted. Only including columns Year, Category and count_quantity_yearly_above_5.

df.groupby(['Year', 'Category']).size().reset_index(name='count_quantity_yearly_above_5')

Here is a simlar problem with more detailed answers

Upvotes: 1

Related Questions