Reputation: 309
My dataset looks like this:
Year Month Day Category Quantity
1984 1 1 2 10.5
1984 1 1 6 3.7
1985 1 2 8 4.8
1985 2 1 3 20
1986 1 1 1 9
1986 2 1 18 12.6
1987 1 29 20 2.8
Note that every day of every month of every year contains one unique entry. In other words each day can only have one category (not several).
I am trying to count how many times each category occurred every year.
However, using count in Pandas I realized zero-counts are not included.
In other words, if a category did not take place one year it is not included. So in order to solve that I tried using: fill_value=0
(as shown in the below code).
I ended up with this (WARNING: don't run this code as is because it eats up all the memory apparently):
import pandas as pd
df = pd.read_csv("import.csv", header=0,
encoding='iso-8859-1')
midx = pd.MultiIndex.from_product([
df['Year'],
df['Category']
], names=['Year', 'Category'])
df['QuantityWithNaN'] = pd.to_numeric(df['Quantity'], errors='coerce')
count_quantity_yearly_above_5 = df[df['QuantityWithNaN'] > 5.0].groupby(['Year', 'Category'])['Quantity'].count()
count_quantity_yearly_above_5.reindex(midx, fill_value=0)
df['count_quantity_yearly_above_5'] = df.apply(count_quantity_yearly_above_5,axis=1)
df.to_csv("export.csv",encoding='iso-8859-1')
The datatypes of the dataframe df is after running that code:
The datatypes for the dataframe that is imported from the CSV is as:
Year int64
Month int64
Day int64
Category int64
Quantity object
QuantityWithNaN float64
The end result should thus be like, but is not achieved with the code above. (The end result must not be sorted in any specific order, only important thing is that all categories appear each year):
Year Month Day Category Quantity count_quantity_yearly_above_5
1984 1 1 1 10.5 2
1984 1 1 2 3.7 7
1984 1 2 3 4.8 1
1985 2 1 1 20 9
1985 1 1 2 9 1
1986 2 1 3 12.6 4
1987 1 29 20 2.8 5
1988 10 2 0
Also for visualization and the information that matters in the end will entirely be given by the columns as follows, so zero-counts are included and there are no duplicate rows for each combination of year and category (obviously I am to lazy, and it would take a lot more space to include all categories (1-20) every year here):
Year Category count_quantity_yearly_above_5
1984 1 2
1984 2 7
1984 3 1
1985 1 9
1985 2 1
1986 3 4
1987 20 5
1988 13 0
Now I end up with a separate series object
(count_quantity_yearly_above_5
) which I want to insert into the original dataframe
df.
With reindex
I want to reduce the number of rows so that there is only one row per unique combination of year and category, which means that each combination of year and category only appear once (in other words for each year each category is only represented once).
Apparently fill_value=0
should tell pandas count
to include zero-counts.
Obviously something is very wrong with the code since all memory is consumed when running it which I suspect is due to one of these lines in the code:
count_quantity_yearly_above_5.reindex(midx, fill_value=0)
df['count_quantity_yearly_above_5'] = df.apply(count_quantity_yearly_above_5,axis=1)
EDIT
The main problem is that I can't get count_quantity_yearly_above_5
-column into the original dataframe, which probably has to do with the fact that count_quantity_yearly_above_5
is a series object. Now I am obviously not importing the series object into the original dataframe correctly.
Any suggestions on how this code could be adjusted?
Running just the line (df['count_quantity_yearly_above_5'] = df.apply(count_quantity_yearly_above_5,axis=1)
) returns the error:
TypeError: ("'Series' object is not callable", 'occurred at index 0')
EDIT 2
I just figured out which line causes 100 % memory usage:
count_quantity_yearly_above_5.reindex(midx, fill_value=0)
Upvotes: 1
Views: 2054
Reputation: 165
You probably want to use a groupby
.
The following will return the final table you wanted. Only including columns Year
, Category
and count_quantity_yearly_above_5
.
df.groupby(['Year', 'Category']).size().reset_index(name='count_quantity_yearly_above_5')
Here is a simlar problem with more detailed answers
Upvotes: 1